NSF Forms

1 downloads 0 Views 2MB Size Report
Check here if you do not wish to provide any or all of the above information .... 1 Lomb Memoria Drive ...... range side-chain interaction potentials from protein crystal data. ...... [9] Thayer MM, Ahern H, Xing D, Cunningham RP, Tainer JA. "Novel ...
Missing:
02 INFORMATION ABOUT PRINCIPAL INVESTIGATORS/PROJECT DIRECTORS(PI/PD) and co-PRINCIPAL INVESTIGATORS/co-PROJECT DIRECTORS Submit only ONE copy of this form for each PI/PD and co-PI/PD identified on the proposal. The form(s) should be attached to the original proposal as specified in GPG Section II.C.a. Submission of this information is voluntary and is not a precondition of award. This information will not be disclosed to external peer reviewers. DO NOT INCLUDE THIS FORM WITH ANY OF THE OTHER COPIES OF YOUR PROPOSAL AS THIS MAY COMPROMISE THE CONFIDENTIALITY OF THE INFORMATION. PI/PD Name:

Vicente M Reyes

Gender:

Male

Female

Ethnicity: (Choose one response)

Hispanic or Latino

Race: (Select one or more)

American Indian or Alaska Native

Not Hispanic or Latino

Asian Black or African American Native Hawaiian or Other Pacific Islander White

Disability Status: (Select one or more)

Hearing Impairment Visual Impairment Mobility/Orthopedic Impairment Other None

Citizenship:

(Choose one)

U.S. Citizen

Permanent Resident

Other non-U.S. Citizen

Check here if you do not wish to provide any or all of the above information (excluding PI/PD name): REQUIRED: Check here if you are currently serving (or have previously served) as a PI, co-PI or PD on any federally funded project Ethnicity Definition: Hispanic or Latino. A person of Mexican, Puerto Rican, Cuban, South or Central American, or other Spanish culture or origin, regardless of race. Race Definitions: American Indian or Alaska Native. A person having origins in any of the original peoples of North and South America (including Central America), and who maintains tribal affiliation or community attachment. Asian. A person having origins in any of the original peoples of the Far East, Southeast Asia, or the Indian subcontinent including, for example, Cambodia, China, India, Japan, Korea, Malaysia, Pakistan, the Philippine Islands, Thailand, and Vietnam. Black or African American. A person having origins in any of the black racial groups of Africa. Native Hawaiian or Other Pacific Islander. A person having origins in any of the original peoples of Hawaii, Guam, Samoa, or other Pacific Islands. White. A person having origins in any of the original peoples of Europe, the Middle East, or North Africa. WHY THIS INFORMATION IS BEING REQUESTED: The Federal Government has a continuing commitment to monitor the operation of its review and award processes to identify and address any inequities based on gender, race, ethnicity, or disability of its proposed PIs/PDs. To gather information needed for this important task, the proposer should submit a single copy of this form for each identified PI/PD with each proposal. Submission of the requested information is voluntary and will not affect the organization’s eligibility for an award. However, information not submitted will seriously undermine the statistical validity, and therefore the usefulness, of information recieved from others. Any individual not wishing to submit some or all the information should check the box provided for this purpose. (The exceptions are the PI/PD name and the information about prior Federal support, the last question above.) Collection of this information is authorized by the NSF Act of 1950, as amended, 42 U.S.C. 1861, et seq. Demographic data allows NSF to gauge whether our programs and other opportunities in science and technology are fairly reaching and benefiting everyone regardless of demographic category; to ensure that those in under-represented groups have the same knowledge of and access to programs and other research and educational oppurtunities; and to assess involvement of international investigators in work supported by NSF. The information may be disclosed to government contractors, experts, volunteers and researchers to complete assigned work; and to other government agencies in order to coordinate and assess programs. The information may be added to the Reviewer file and used to select potential candidates to serve as peer reviewers or advisory committee members. See Systems of Records, NSF-50, "Principal Investigator/Proposal File and Associated Records", 63 Federal Register 267 (January 5, 1998), and NSF-51, "Reviewer/Proposal File and Associated Records", 63 Federal Register 268 (January 5, 1998).

COVER SHEET FOR PROPOSAL TO THE NATIONAL SCIENCE FOUNDATION PROGRAM ANNOUNCEMENT/SOLICITATION NO./CLOSING DATE/if not in response to a program announcement/solicitation enter NSF 10-1

NSF 10-567

FOR NSF USE ONLY

NSF PROPOSAL NUMBER

08/23/10

FOR CONSIDERATION BY NSF ORGANIZATION UNIT(S)

(Indicate the most specific unit known, i.e. program, division, etc.)

DBI - ADVANCES IN BIO INFORMATICS DATE RECEIVED NUMBER OF COPIES DIVISION ASSIGNED FUND CODE DUNS# (Data Universal Numbering System)

FILE LOCATION

002223642 EMPLOYER IDENTIFICATION NUMBER (EIN) OR TAXPAYER IDENTIFICATION NUMBER (TIN)

IS THIS PROPOSAL BEING SUBMITTED TO ANOTHER FEDERAL AGENCY? YES NO IF YES, LIST ACRONYM(S)

SHOW PREVIOUS AWARD NO. IF THIS IS A RENEWAL AN ACCOMPLISHMENT-BASED RENEWAL

160743140 NAME OF ORGANIZATION TO WHICH AWARD SHOULD BE MADE

ADDRESS OF AWARDEE ORGANIZATION, INCLUDING 9 DIGIT ZIP CODE

Rochester Institute of Tech 1 Lomb Memoria Drive Rochester, NY. 146235603

Rochester Institute of Tech AWARDEE ORGANIZATION CODE (IF KNOWN)

0028068000 NAME OF PERFORMING ORGANIZATION, IF DIFFERENT FROM ABOVE

ADDRESS OF PERFORMING ORGANIZATION, IF DIFFERENT, INCLUDING 9 DIGIT ZIP CODE

PERFORMING ORGANIZATION CODE (IF KNOWN)

IS AWARDEE ORGANIZATION (Check All That Apply) (See GPG II.C For Definitions) TITLE OF PROPOSED PROJECT

MINORITY BUSINESS IF THIS IS A PRELIMINARY PROPOSAL WOMAN-OWNED BUSINESS THEN CHECK HERE

ABI Innovation: Use of a Reduced Protein Representation for the Modeling and Screening of Ligand Binding Sites: A Structure-Based Protein Function Prediction Method

REQUESTED AMOUNT

PROPOSED DURATION (1-60 MONTHS)

496,892

$

SMALL BUSINESS FOR-PROFIT ORGANIZATION

36

REQUESTED STARTING DATE

09/01/11

months

SHOW RELATED PRELIMINARY PROPOSAL NO. IF APPLICABLE

CHECK APPROPRIATE BOX(ES) IF THIS PROPOSAL INCLUDES ANY OF THE ITEMS LISTED BELOW BEGINNING INVESTIGATOR (GPG I.G.2) HUMAN SUBJECTS (GPG II.D.7) Human Subjects Assurance Number DISCLOSURE OF LOBBYING ACTIVITIES (GPG II.C.1.e)

Exemption Subsection

PROPRIETARY & PRIVILEGED INFORMATION (GPG I.D, II.C.1.d)

INTERNATIONAL COOPERATIVE ACTIVITIES: COUNTRY/COUNTRIES INVOLVED

HISTORIC PLACES (GPG II.C.2.j)

(GPG II.C.2.j)

EAGER* (GPG II.D.2)

RAPID** (GPG II.D.1)

VERTEBRATE ANIMALS (GPG II.D.6) IACUC App. Date

HIGH RESOLUTION GRAPHICS/OTHER GRAPHICS WHERE EXACT COLOR REPRESENTATION IS REQUIRED FOR PROPER INTERPRETATION (GPG I.G.1)

PHS Animal Welfare Assurance Number PI/PD DEPARTMENT

PI/PD POSTAL ADDRESS

1 LOMB MEMORIAL DR

Biological Sciences PI/PD FAX NUMBER NAMES (TYPED)

or IRB App. Date

ROCHESTER, NY 146235603 United States High Degree

Yr of Degree

Telephone Number

PhD

1988

585-475-4115

Electronic Mail Address

PI/PD NAME

Vicente M Reyes CO-PI/PD

CO-PI/PD

CO-PI/PD

CO-PI/PD

Page 1 of 2

[email protected]

CERTIFICATION PAGE Certification for Authorized Organizational Representative or Individual Applicant: By signing and submitting this proposal, the Authorized Organizational Representative or Individual Applicant is: (1) certifying that statements made herein are true and complete to the best of his/her knowledge; and (2) agreeing to accept the obligation to comply with NSF award terms and conditions if an award is made as a result of this application. Further, the applicant is hereby providing certifications regarding debarment and suspension, drug-free workplace, lobbying activities (see below), responsible conduct of research, nondiscrimination, and flood hazard insurance (when applicable) as set forth in the NSF Proposal & Award Policies & Procedures Guide, Part I: the Grant Proposal Guide (GPG) (NSF 10-1). Willful provision of false information in this application and its supporting documents or in reports required under an ensuing award is a criminal offense (U. S. Code, Title 18, Section 1001).

Conflict of Interest Certification In addition, if the applicant institution employs more than fifty persons, by electronically signing the NSF Proposal Cover Sheet, the Authorized Organizational Representative of the applicant institution is certifying that the institution has implemented a written and enforced conflict of interest policy that is consistent with the provisions of the NSF Proposal & Award Policies & Procedures Guide, Part II, Award & Administration Guide (AAG) Chapter IV.A; that to the best of his/her knowledge, all financial disclosures required by that conflict of interest policy have been made; and that all identified conflicts of interest will have been satisfactorily managed, reduced or eliminated prior to the institution’s expenditure of any funds under the award, in accordance with the institution’s conflict of interest policy. Conflicts which cannot be satisfactorily managed, reduced or eliminated must be disclosed to NSF.

Drug Free Work Place Certification By electronically signing the NSF Proposal Cover Sheet, the Authorized Organizational Representative or Individual Applicant is providing the Drug Free Work Place Certification contained in Exhibit II-3 of the Grant Proposal Guide.

Debarment and Suspension Certification

(If answer "yes", please provide explanation.)

Is the organization or its principals presently debarred, suspended, proposed for debarment, declared ineligible, or voluntarily excluded from covered transactions by any Federal department or agency?

Yes

No

By electronically signing the NSF Proposal Cover Sheet, the Authorized Organizational Representative or Individual Applicant is providing the Debarment and Suspension Certification contained in Exhibit II-4 of the Grant Proposal Guide.

Certification Regarding Lobbying The following certification is required for an award of a Federal contract, grant, or cooperative agreement exceeding $100,000 and for an award of a Federal loan or a commitment providing for the United States to insure or guarantee a loan exceeding $150,000.

Certification for Contracts, Grants, Loans and Cooperative Agreements The undersigned certifies, to the best of his or her knowledge and belief, that: (1) No federal appropriated funds have been paid or will be paid, by or on behalf of the undersigned, to any person for influencing or attempting to influence an officer or employee of any agency, a Member of Congress, an officer or employee of Congress, or an employee of a Member of Congress in connection with the awarding of any federal contract, the making of any Federal grant, the making of any Federal loan, the entering into of any cooperative agreement, and the extension, continuation, renewal, amendment, or modification of any Federal contract, grant, loan, or cooperative agreement. (2) If any funds other than Federal appropriated funds have been paid or will be paid to any person for influencing or attempting to influence an officer or employee of any agency, a Member of Congress, an officer or employee of Congress, or an employee of a Member of Congress in connection with this Federal contract, grant, loan, or cooperative agreement, the undersigned shall complete and submit Standard Form-LLL, ‘‘Disclosure of Lobbying Activities,’’ in accordance with its instructions. (3) The undersigned shall require that the language of this certification be included in the award documents for all subawards at all tiers including subcontracts, subgrants, and contracts under grants, loans, and cooperative agreements and that all subrecipients shall certify and disclose accordingly. This certification is a material representation of fact upon which reliance was placed when this transaction was made or entered into. Submission of this certification is a prerequisite for making or entering into this transaction imposed by section 1352, Title 31, U.S. Code. Any person who fails to file the required certification shall be subject to a civil penalty of not less than $10,000 and not more than $100,000 for each such failure.

Certification Regarding Nondiscrimination By electronically signing the NSF Proposal Cover Sheet, the Authorized Organizational Representative is providing the Certification Regarding Nondiscrimination contained in Exhibit II-6 of the Grant Proposal Guide.

Certification Regarding Flood Hazard Insurance Two sections of the National Flood Insurance Act of 1968 (42 USC §4012a and §4106) bar Federal agencies from giving financial assistance for acquisition or construction purposes in any area identified by the Federal Emergency Management Agency (FEMA) as having special flood hazards unless the: (1) community in which that area is located participates in the national flood insurance program; and (2) building (and any related equipment) is covered by adequate flood insurance. By electronically signing the NSF Proposal Cover Sheet, the Authorized Organizational Representative or Individual Applicant located in FEMA-designated special flood hazard areas is certifying that adequate flood insurance has been or will be obtained in the following situations: (1) for NSF grants for the construction of a building or facility, regardless of the dollar amount of the grant; and (2) for other NSF Grants when more than $25,000 has been budgeted in the proposal for repair, alteration or improvement (construction) of a building or facility.

Certification Regarding Responsible Conduct of Research (RCR) (This certification is not applicable to proposals for conferences, symposia, and workshops.) By electronically signing the NSF Proposal Cover Sheet, the Authorized Organizational Representative of the applicant institution is certifying that, in accordance with the NSF Proposal & Award Policies & Procedures Guide, Part II, Award & Administration Guide (AAG) Chapter IV.B., the institution has a plan in place to provide appropriate training and oversight in the responsible and ethical conduct of research to undergraduates, graduate students and postdoctoral researchers who will be supported by NSF to conduct research. The undersigned shall require that the language of this certification be included in any award documents for all subawards at all tiers. AUTHORIZED ORGANIZATIONAL REPRESENTATIVE

SIGNATURE

DATE

NAME

TELEPHONE NUMBER

ELECTRONIC MAIL ADDRESS

FAX NUMBER

fm1207rrs-07

* EAGER - EArly-concept Grants for Exploratory Research ** RAPID - Grants for Rapid Response Research Page 2 of 2

Directorate for Biological Sciences Division of Biological Infrastructure Advances in Bio Informatics Proposal Classification Form PI: Reyes, Vicente CATEGORY I: INVESTIGATOR STATUS (Select ONE) Beginning Investigator - No previous Federal support as PI or Co-PI, excluding fellowships, dissertations, planning grants, etc. Prior Federal support only Current Federal support only Current & prior Federal support

CATEGORY II: FIELDS OF SCIENCE OTHER THAN BIOLOGY INVOLVED IN THIS RESEARCH (Select 1 to 3) Astronomy Chemistry Computer Science Earth Science

Engineering Mathematics Physics

Psychology Social Sciences None of the Above

CATEGORY III: SUBSTANTIVE AREA (Select 1 to 4) BIOMATERIALS BIOTECHNOLOGY Animal Biotechnology Plant Biotechnology Environmental Biotechnology Marine Biotechnology Metabolic Engineering CHROMOSOME STUDIES COMMUNITY ECOLOGY COMPUTATIONAL BIOLOGY CONSERVATION & RESTORATION BIOLOGY CORAL REEFS CURATION DATABASES ECOSYSTEMS LEVEL GENOMICS (Genome sequence, organization, function)

Viral Microbial Fungal Plant Animal

INFORMATICS MARINE MAMMALS Molecular Evolution Methodology/Theory Gene/Genome Mapping Natural Products NANOSCIENCE PHOTOSYNTHESIS PLANT BIOLOGY Arabidopsis-Related Plant Research POPULATION DYNAMICS & LIFE HISTORY

POPULATION GENETICS & BREEDING SYSTEMS REPRODUCTIVE ANIMAL BIOLOGY Plant Pathology Coevolution Biological Control STATISTICS & MODELING Methods/ Instrumentation/ Software Modeling (general) Modeling of Biological or Molecular Systems Computational Modeling

Statistics (general) STRUCTURAL BIOLOGY SYSTEMATICS Phenetics/Cladistics/ Numerical Taxonomy NONE OF THE ABOVE

CATEGORY IV: INFRASTRUCTURE (Select 1 to 3) COLLECTIONS/STOCK CULTURES Collection Enhancement Collection Refurbishment Living Organism Stock Cultures Natural History Collections DATABASES Database Initiation

Database Enhancement Database Maintenance & Curation Database Methods FACILITIES Controlled Environment Facilities Field Stations Field Facility Structure

Field Facility Equipment

LTER Site GENOME SEQUENCING Other Plant Genome Sequencing

INDUSTRY PARTICIPATION INSTRUMENTATION Instrument Development

Page 1

Instrument Acquisition Computational Hardware Development/Acquisition TOOLS DEVELOPMENT Analytical Algorithm Development Other Software Development Informatics Tool Development

Technique Development TRACKING SYSTEMS Geographic Information Systems Remote Sensing TRAINING

Multi-, Cross-, Interdisciplinary Training Undergraduate Training Predoctoral Training Postdoctoral Training NONE OF THE ABOVE

CATEGORY V: HABITAT (No selection required) CATEGORY VI: GEOGRAPHIC AREA OF THE RESEARCH (No selection required) CATEGORY VII: CLASSIFICATION OF ORGANISMS (Select 1 to 4) VIRUSES

PLANTS

Bacterial

N0N-VASCULAR PLANTS

Plant

VASCULAR PLANTS

Animal

PROKARYOTES

GYMNOSPERMS ANGIOSPERMS

Archaebacteria

Monocots

Cyanobacteria

Dicots

Eubacteria

PROTISTA (PROTOZOA) FUNGI LICHENS SLIME MOLDS ALGAE

ANIMALS INVERTEBRATES ARTHROPODA Hexapoda (Insecta) (Insects) VERTEBRATES FISHES

Chondrichthyes (Cartilaginous Fishes) (Sharks, Rays, Ratfish) Osteichthyes (Bony Fishes) AMPHIBIA REPTILIA AVES (Birds) MAMMALIA Primates Humans Rodentia Marine Mammals (Seals, Walrus, Whales, Otters, Dolphins, Porpoises)

TRANSGENIC ORGANISMS NO ORGANISMS

CATEGORY VIII: MODEL ORGANISM (Select ONE) NO MODEL ORGANISM MODEL ORGANISM (Choose from the list or input up to 9 characters)

FUNGAL PLANT Mouse-Ear Cress (Arabidopsis thaliana)

Fruitfly (Drosophila melanogaster)

[Enter your own model organism - up to 9 characters]

Escherichia coli

Page 2

1 Abstract. This proposal is essentially of two parts: (a.) the development of a protein reduced representation and its implementation in a web server; and (b.) the use of the reduced protein representation in the modeling of the binding site of a given ligand and the screening for the model in other protein 3D structures, thus making it a structure-based protein function prediction method. Current methods of reduced protein 3D structure representation such as the Cα trace method, not only lack essential molecular detail but also ignore the chemical properties of the component amino acid side chains. We describe a reduced protein 3D structure representation called "double-centroid reduced representation" (DCRR) and present a visualization tool called the "DCRR Web Server" (http:// tortellini. bioinformatics. rit. edu/ vns4483/ dcrr.php) that graphically displays a protein 3D structure in DCRR along with non-covalent intra- and intermolecular hydrogen bonding and van der Waals interactions (including ordered water molecules). In the DCRR model, each amino acid residue is represented by two points: the centroid of the backbone atoms and that of the side chain atoms; in the visualization web server, they and the non-bonded interactions are color-coded for easy identification. Our visualization tool is implemented in MATLAB and to our knowledge is the first for any reduced protein representation as well as one that simultaneously displays non-covalent interactions in the molecule. The DCRR model reduces the atomicity of the protein structure by ~75% while capturing the essential chemical properties of the amino acids, and therefore of the protein as a whole. In the second half, we describe the application of DCRR to the modeling and screening of ligand binding sites using a data model we term the "tetrahedral motif," which consists of the four amino acid centroids (side chain or backbone) in the protein that form the four most dominant interactions (H-bond or VDW) with the ligand. The dependence of our ligand binding site modeling and screening on the reduced representation gives it unique advantages over current methods, among which are ease, simplicity, speed and high throughput of screening. Here we describe the entire modeling and screening procedure, which are implemented in Fortran 77/90, as well as show ample preliminary results. We are currently extending the method to incorporate nucleic acids (DNA and RNA) in a reduced representation called "triple-centroid reduced representation" (TCRR) where each nucleotide is represented by three points, namely, the centroids of the base, sugar and phosphate atoms. Intellectual Merit. This work offers a lot of opportunity in advancing knowledge across different domains since it combines mathematics, computing, molecular biology and proteomics. The PI is multidisciplinary researcher with formal training in mathematics, chemistry, biochemistry, molecular biology, statistics data mining, bioinformatics and computational biology. Part of this work proposes reduced representations not only for proteins ("DCRR", double-centroid reduced representation) but also for DNA and RNA ("TCRR", triple-centroid reduced representation) - these have the potential to transform current work on all-atom representation of macromolecules to a more simplified version consisting of centroids. The seminal ideas underlying this project were conceived in the early 2007 while the PI was a post-doc at UCSD; they were carefully developed and refined throughout the next couple of years until he joined the RIT faculty in late 2008, and up to the present. The PI has access various software and approx. 10 Tb of data storage at the department level, and to even more computing resources and memory space at RIT's I.T. Collaboratory and Research Computing Department. Broader Impact. The PI teaches courses in proteomics and bioinformatics and has initiated a researched-based teaching method in order to advance discovery while promoting teaching, training and learning. In this method, the students are given a "freestyle" laboratory exercise involving an open problem in the field. He will present the successful results of this innovative teaching technique in a video clip at ICERI 2010 (International Conference on Education, Research and Innovation). Our institute is the home of the National Technical Institute for the Deaf (NTID), and as such we have the unique opportunity to increase the participation of this underrepresented group into our research programs. Our institute also serves many talented minority and foreign students. Here we propose to create databases for reduced representations of proteins and nucleic acids (DCRR and TCRR, respectively) and transform all current structures in the PDB (Protein Data Bank) and NDB (Nucleic Acid Database), thereby enhancing infrastructure for research and education. The PI along with a post-doc, a technician and graduate students plan to actively participate in national as well as international meetings in order to disseminate their work. These activities will be the prelude to the publication of their findings in primary, peerreviewed scientific journals. Among the applications of this proposed work is the modeling and screening for ligand binding sites, which has significant implications on drug design - a societal benefit. A post-doc, a technician and undergraduate and graduate students will all be trained in the scientific method and way of thinking, so they can become independent scientific researchers in the future.

TABLE OF CONTENTS For font size and page formatting specifications, see GPG section II.B.2.

Total No. of Pages

Page No.* (Optional)*

Cover Sheet for Proposal to the National Science Foundation Project Summary

(not to exceed 1 page)

1

Table of Contents

1

Project Description (Including Results from Prior NSF Support) (not to exceed 15 pages) (Exceed only if allowed by a specific program announcement/solicitation or if approved in advance by the appropriate NSF Assistant Director or designee)

15

References Cited

2

Biographical Sketches

(Not to exceed 2 pages each)

Budget

2 7

(Plus up to 3 pages of budget justification)

Current and Pending Support

1

Facilities, Equipment and Other Resources

1

Special Information/Other Supplementary Docs/Mentoring Plan

1

Appendix (List below. ) (Include only if allowed by a specific program announcement/ solicitation or if approved in advance by the appropriate NSF Assistant Director or designee) Appendix Items:

*Proposers may select any numbering mechanism for the proposal. The entire proposal however, must be paginated. Complete both columns only if the proposal is numbered consecutively.

1

Use of a Reduced Protein Representation for the Modeling and Screening of Ligand Binding Sites: A Structure-Based Protein Function Prediction Method Submitted by Vicente M. Reyes, Ph.D. [e-mail: [email protected]] ____________________________________________________________________________________ TABLE OF CONTENTS: I. A B. C. D. E.

page Introduction & Background. ......................................................................................................2 Statement of Work and Specific Objectives. ........................................................................ 2 General Plan and Design of Work. ....................................................................................... 3 Relation of Work to Long-Term Goals of PI............................................................................ 3 Relation of Work to State of Knowledge in the Field. ............................................................. 3

II. Experimental Methods and Procedures. ..................................................................................... 3 A. Converting AAR to DCRR. ...................................................................................................... 3 B. Calculation and Optimization of H-Bonds and VDW interactions for Display. ....................... 3 C. The DCRR Web Server. ........................................................................................................ 4 D. The Tetrahedral Motif Data Model. ......................................................................................... 5 E. Constructing the Tetrahedral LBS Motif. ................................................................................. 5 F. Screening for the Tetrahedral Motif. ........................................................................................ 5 III. Preliminary Results. ................................................................................................................ 6 A. Part (a.): The Protein Double-Centroid Reduced Representation (DCRR). .......................... 6 1. Visualization of the Protein DCRR. .................................................................................... 6 2. The DCRR Web Server. ..................................................................................................... 6 B. Part (b.): The Tetrahedral Motif Model. ................................................................................. 7 1. Constructing the Model. ...................................................................................................... 7 2. Screening for the Model. ................................................................................................... 7 C. Extension to Protein-Protein Interactions. ............................................................................. 9 V.

Concluding Remarks. ................................................................................................................ 9 A. Conclusions and Future Directions. ........................................................................................ 9 B. Preservation/Documentation/Sharing of Data & Related Research/Education Products. ..... 11

V. Broader Impact. ......................................................................................................................... 11 VI. Management Plan. ...................................................................................................................... 12 VII. Miscellaneous Items. .................................................................................................................. 12 A. Keywords. ................................................................................................................................. 12 B. Abbreviations. ........................................................................................................................... 12 C. Definitions. ................................................................................................................................ 13 VIII. Appendix. .................................................................................................................................... 13 ___________________________________________________________________________________

2 I. Introduction & Background. There exist different methods of protein 3D structure representation and visualization methods, the most popular of which being the all-atom representation (AAR), ribbon or 'spaghetti' representations, and space-filling models (please see refs. Sayle & Milner, 2000; DeLano, 2002; Guex, et al., 1999; Schwede, et al., 2003; Richardson & Richardson, 1992). The AAR model such as the wireframe and ball-and-stick models display every atom of the protein. But, even though all chemical information of the component amino acid residues are accounted for, the display is too crowded and overwhelming. On the other hand, van der Waals (VDW) surface representations such as the spacefilling model are a good way to view the surface properties of the protein and locate shape complementarity involved in protein interactions but they fail to clearly show secondary structures, loops, functional sites and non-covalent interactions. Finally, ribbon and spaghetti models and the like provide a good view of the secondary structures and loops but do not show any side chain structural elements. In this paper we describe "double-centroid reduced representation" (DCRR), a reduced protein representation wherein amino acid residues in the protein are represented by two point coordinates: the centroid of the backbone atoms (N, CĮ &  DQG 2  DQG WKH FHQWURLG RI WKH VLGH FKDLQ DWRPV &ȕ and beyond). In AAR, each atom of the protein has a 3D coordinate; in the DCRR, there are only two coordinates per amino acid residue: that of the centroid of the backbone atoms, and that of the centroid of the side chain atoms. Typically the DCRR has about 76% less atomicity than the AAR for the same protein structure. DCRR is similar to, but not identical to and independently conceived from, that proposed by Kolinski (2004) and Liwo, et al. (1997). ,Q WKHVH WZR PRGHOV  WKH &Į SRVLWLRQ LV XVHG instead of the centroid of the backbone atoms, and additionally in the Liwo, at al. (1997) method, a 'united SHSWLGH JURXS  LV LQVHUWHG EHWZHHQ WZR FRQVHFXWLYH &Į DWRPV WR ZKLFK WKH FRUUHVSRQGLQJ XQLWHG sidechain group' is attached by a virtual bond. We have further developed a graphical visualization tool implemented in MATLAB that displays the reduced representation of the input protein PDB file, while simultaneously showing the intramolecular Hbonds and VDW interactions, as well as intermolecular ones with any bound ligands and water molecules. Another of our aims was to develop a way of modeling ligand binding sites and to screen for these models in other proteins. We thought that this might find applications in ligand binding site modeling and screening that is quite different or even improved relative to current methods (Guner et al., 2004; Guner, 2005; Hopfinger, 2000; Khedkar et al., 2007; Mason, et al., 2001; Sun, 2008). An equally important goal of ours was to develop a mathematical and computation model for ligand binding sites (LBSs) that can be used for screening (i.e., detecting their presence in) protein structures. We thus proceeded to apply the DCRR method to the modeling of LBSs in proteins. Our LBS model is composed of the four most dominant amino acid centroids of the protein (in DCRR) which interact with the ligand atoms. These interactions may be in the form of hydrogen bonds or van der Waals interactions. The four centroids form a tetrahedron in 3D space, hence we term the model 'tetrahedral motif' model. Finally we have developed a screening method for the tetrahedral motif (a.k.a. 3D search motif) in any given protein, in order to predict whether the given protein would bind the ligand whose binding site tetrahedral motif is being sought. The screening procedure is composed of a series of Fortran programs that takes in two inputs, namely, a protein PDB structure file in DCRR, and the dimensions and centroid identities of the tetrahedral motif under query. The programs then either outputs the coordinates of four centroids in the protein that closely matches the tetrahedral motif if it finds one, or outputs null if it does not. B. Specific Objectives and Expected Significance. The specific objectives of this proposal is the continuance and further development of the work described the previous section . They are: 1. To continue to develop and refine our protein DCRR Web Server (URL: http:// tortellini. bioinformatics. rit. edu/ vns4483/ dcrr.php), ultimately to cover all structures in the PDB, with regular updates. 2. To construct 3D binding site tetrahedral motifs for other biologically important ligands, such as FMN, FAD, NAD/NADP, cholesterol, etc. 3. To extend the work to protein-protein interaction (PPI) interfaces, where there are two 3D binding site tetrahedral motifs, one on each interacting interface. 4. To screen novel proteins with known structures but unknown functions (whose numbers are increasing in the PDB) for the 3D binding site tetrahedral motifs found above.

3 5. To extend the above work to nucleic acid (DNA and RNA) structures, where the reduced representation will be "triple-centroid reduced representation" (TCRR). C. General Plan and Design of Work. The work on all five specific objectives will be done in parallel since they have very minimal or no dependencies. The great majority of calculation programs will be implemented in Fortran 77 or 90, embedded in UNIX C-shell or Perl scripts. Thus far, the author has written all of the Fortran programs and this scenario will likely to continue until a proficient Fortran programmer joins the group. Flow diagrams that incorporate several Fortran programs that perform unit steps in an entire algorithmic procedure will be used extensively, as has been done in the past. To test the procedure, the scripts are run on first small datasets (10-12 elements), the results of which are then manually confirmed individually. If this step is successful, medium-sized datasets (50-60 elements) are used, and about a quarter of the results randomly selected and checked manually. If this step proves a success, the full possible dataset (usually a subset of or the entire PDB dataset) is used. While these steps are proceeding, the programs and scripts are constantly tested, refined and optimized. D. Relation of Work to Long-term goals of PI. This work fits in excellently within the overall research program of the author, which is in the area of computational structural biology, and which revolves around novel methods of 3D representation and 3D structural analysis of proteins and other biomolecular structures (DNA, RNA, carbohydrates and lipids). These research area is also an area of strength of the author since he has had several years of training in x-ray crystallography besides wet-bench molecular biology and biochemistry, as well as mathematics and computational science. E. Relation fo Work to State of Knowledge in the Field. As far as novel representations of proteins 3D structures is concerned, we can cite a number of works (Barlow & Richards, 1995; Li et al., 2003), although none deals specifically with reduced representation of proteins for the purpose of ligand binding site modeling and screening. There is very sparse literature on the aforementioned subject, so a comprehensive literature review is not currently possible. In the broader subject of reduced representations of proteins, work by H. Scheraga's research group on the UNRES (united residue) model of the protein polypeptide chain is notable; they mainly use it for their work on molecular simulations using Langevin dynamics (Liwo et al., 1997). The work by Kolinski (2004) has also been mentioned earlier. II. Experimental Methods and Procedures. This report essentially consists of two parts, namely (a.) the implementation of the protein double-centroid reduced representation and the creation of a web server for it, and (b.) the development of a ligand binding site modeling method and screening for such model in any given protein 3D structure. Part (a.): We have developed a new model of protein structure representation that provides a balance between too much information and too little information and at the same time captures the chemical information of the side-chains. We have implemented a visualization interface that displays the protein in a reduced representation along with displaying its H-bond and van der Waals interactions. We call the representation the 'double-centroid reduced representation' (DCRR) as each amino acid is represented as 2 data points: the centroid of the backbone and the centroid of the side-chain. DCRR reduces the atomicity of the protein highlighting just enough chemical information embodied in the side-chains. We also developed a web server, the 'Protein DCRR Web Server' wherein users can enter a PDB id or upload a model protein and get the co-ordinates as well as the structure of the protein in DCRR. A. Converting AAR to DCRR. The protein coordinate file from the PDB (Berman et al., 2002) is converted from its all-atom representation to DCRR by calculating (1.) the centroid of the backbone atoms N, CĮ C' and O, of each amino acids, and (2.) the centroid of the side chain atoms Cȕ and beyond of each amino acid. No weights (such as atomic weights) were used in the calculation of the centroids; only the atomic positions (x,y,z) were considered in calculating the centroids. B. Calculation and Optimization of H-bond and VDW Distances for Display. Nearest neighbor i analysis is used to identify H-bonding and VDW interactions. A sphere typically of radius 5.0 - 6.0 Å s constructed around every atom in the protein as center; all other atoms found in the interior of such a sphere is considered 'neighbors' of the central atom. Hydrogen bonds are taken to be those that are

4 within close neighborhood of 2.80 Å between central atom and neighbor (see below), with the compatible chemical identities (those involving P, O, N and/or S). As for van der Waals interaction, we considered only those of the C-H• • • • H-C type and whose distances between the carbon atoms are within close neighborhood of 3.38 Å (see below). We next determined the appropriate number of H-bonds and van der Waals interactions to show in the display that is not too many to crowd the display and not too few to miss the important ones. We designed a window around the ideal H-bond and van der Waal distances and the upper and lower limits in both were varied until an optimal number of interactions for display were obtained. The ideal H-bond length is 2.80 Å (Jeffrey, 1997) and the ideal C-H• • • • H-C van der Waals distance is 3.38 Å (Bondi, 1964; Kuzmin & Katzer, 2005; Nyburg & Faerman,1985). After performing several trials, we designate a recommended range of 2.73 Å to 3.22 Å for H-bonds and 3.20 Å to 3.85 Å for VDW interactions. In the Web Server user has have a choice of using the recommended limits above, as well as wide limits or narrow limits. The wide and narrow limits for H-bonds are 2.66 Å - 3.36 Å and 2.75 Å - 3.00 Å, respectively. Figure 1 The wide and narrow limits for VDW interactions are 3.10 Å - 3.95 Å and 3.30 Å - 3.75 Å, respectively. C. The DCRR Web Server. To make the DCRR method freely available to the scientific community, we have created the "Protein DCRR Web Server" at the Rochester Institute of Technology's Bioinformatics Division, and its URL is http : // tortellini. bioinformatics. rit. edu / vns4483 /dcrr.php. The web server interfaces a database which contains the DCRR co-ordinates of well over 50,000 protein structures from the PDB, as well as the MATLAB image of the proteins in DCRR, along with intermolecular and intramolecular H-bonds and VDW interactions. The image also shows any bound ligands and the ordered water molecules in x-ray crystallographic structures. In the future we plan to include NMR structures in our database. Figure 1 shows the DCRR web server home page. Part (b.): We have developed a method for the mathematical and computational modeling of the binding site of any given ligand in a protein based on the 'double-centroid reduced representation' (DCRR) of the protein. We designate the model as 'tetrahedral motif' as it is composed of four points in space. An algorithm has also been developed to Figure 2

5 search for this motif in any given protein 3D structure, thereby providing a novel way to predict the occurrence of LBSs in structures of new proteins. D. The Tetrahedral Motif Data Model. Our initial objective was to model the binding site of any given ligand using a reduced protein representation so that a computationally economical and general screening procedure for the said model could be developed. Since the protein DCRR coordinates as well as the H-bonds and van der Waals interactions have already been pre-computed in the our database, modeling a given ligand will simply involve the determination of the four most dominant H-bonds and/or van der Waals interactions between protein and ligand atoms. The result of such determination is the 'tetrahedral motif' model for the ligand in question (see Figure 2). Ideally this procedure is performed on several proteins containing the same ligand, in order to arrive at a consensus motif. E. Constructing the Tetrahedral LBS Motif. We shall describe the procedure only briefly here, as a more comprehensive manuscript describing the application of the present method to LBS and pharmacophore modeling is in the process of publication elsewhere (V.M. Reyes, in preparation). First a training set consisting of protein 3D structures with bound ligand of interest is selected from the PDB. For each training structure, nearest protein atom neighbors of each ligand atom is determined using a nearest-neighbor Fortran program that finds all protein atoms within a sphere of radius ca. 6.0 Å of each ligand atom (Figure 3). From these nearest neighbors, those which are judged to be H-bonds or -CH...HC- van der Waals interactions are further selected; selection is based Figure 3 on atom identities of the neighbors and their distances from each other, and implemented using a Fortran program. Then the four most dominant H-bond or VDW interactions are selected as the vertices of the tetrahedron. One vertex (usually the most dominant) is arbitrarily designated as the "root", and the other three as "node1", "node2" and "node3" (R, n1, n2 and n3, respectively, for short). A "dominant" interaction is one that either occurs most frequently in the training structures, and/or has the most ideal Hbonding or VDW distance between the proteinligand atom neighbors involved. The validity of such feature extraction from a set of heterogeneous proteins binding the same ligand has its roots from the work of Kobayashi and Go (1997a, 1997b), who showed that the LBS for ATP have nearly identical or very similar architectures in a set of heterogeneous ATP-binding proteins; they showed a similar phenomenon for a set of heterogeneous GTP-binding proteins. Figure 4 shows a ligand bound in its binding site with the protein in AAR versus one in which the protein is in DCRR.

Figure 4

F. Screening for the Tetrahedral Motif. Once the tetrahedral motif - or, more preferably, a consensus tetrahedral motif - is determined, the next logical step would be to find out if it occurs as well in other proteins; if it does, then those proteins where it occurs would be potential receptors for the ligand in question. If the proteins are functionally unannotated, this may be

6 considered as a way of assigning function to those proteins, since knowing what ligand(s) a protein binds gives us a clue about its biological function. We developed a five-step search algorithm for the tetrahedral motif (or consensus motif). They are all implemented in either Fortran 77 or 90, and are illustrated and discussed in the next section. III. Preliminary Results. A. Part (a.): The Protein Double-Centroid Reduced Representation (DCRR). 1. Visualization of Protein DCRR. The image of an all-Į SURWHLQ 1RFY, in DCRR with bound ordered water molecules as well as intraand intermolecular H-bonds and VDW interactions, is shown in Figure 5 A. The H-bonds are shown as blue dashed lines and the VDW interactions as red dashed lines. Black solid lines connect adjoining backbone centroids, while solid orange lines connect side chain centroids with their respective backbone centroids. Ligands are shown as bright green triangles and bound ordered water molecules as blue squares. Every amino acid is labeled using a single letter code and color coded according to its polarity (hydrophilicity) . The image includes a legend located on the right hand side allowing easy identification of the amino acids and their interactions.

Figure 5 A

For comparison, the image of protein 292 DQ Įȕ SURWHLQ in Figure 5 B. For further examples of DCRR structures, please check out our Protein DCRR Web Server at the URL given earlier. 2. The DCRR Web Server. The image of the DCRR Web Server was shown in Figure 1; it is available to the public at URL http: // tortellini. bioinformatics. rit. edu / vns4483 / dcrr.php. Users simply enter the PDB ID of the structure they wish to view in DCRR. If the protein is not deposited in the PDB, they may upload its structure Figure 5 B

7 coordinates by clicking the "Browse" button; they will then obtain a link containing the DCRR of the protein. Result will also be e-mailed to the users if they provide an e-mail address. We also believe that our DCRR visualization tool would be a useful pedagogical tool for both K-12 and college students. B. Part (b.): The Tetrahedral Motif Model 1. Constructing the Model. The 'tetrahedral motif' model of a ligand binding site was shown in Figure 2. It is composed of four points, namely, a unique root, R, and three different nodes, n1, n2 and n3. Each corresponds to an amino acid backbone or side chain centroid in the protein, and each has a set of (x,y,z) coordinates. These amino acids are in H-bonding and/or VDW interaction with ligand atoms. Also included in the data model are the lengths of the six sides of the tetrahedron, namely, the three branches Rn1, Rn2 and Rn3, and the three node-edges, n1n2, n2n3 and n1n3, all in Angstrom units, Å. Note what we call 'branches' are root-to-node edges, while 'node-edges' are node-to-node edges. Thus the tetrahedral motif may be considered to be a data model that contains 14 parameters, of which eight qualitative and six are quantitative. The eight qualitative parameters are the amino acid identities of the four centroids, in combination with their being backbone or side chain centroids (4 x 2 = 8), while the six quantitative parameters are the lengths of the six edges mentioned above. Our tetrahedral model is therefore information-rich and thus expected to be highly specific. Incidentally, in developing our data model, we also tried two other possibilities, namely: (a.) three points in space, or a 'plane triangular' model', and (b.) five points in space, or a 'pentahedral' model. The plane triangular model has low specificity and produced many false positives. The pentahedral model, on the other hand, was too computationally cumbersome. The tetrahedral motif proved to be the optimal model. 2. Screening for the Model. The procedure for screening for the binding site tetrahedral motif of a given protein in a protein 3D structure is outlined in Table 1 and illustrated step-by-step in Figures 6-11. When used in screening, the tetrahedral motif is oftentimes called the '3D search motif' (3D SM). Table 1. Algorithm for Screening a Protein 3D Structure for a Tetrahedral 3D Search Motif Step #: 0 1 2 3 4 5

Start with protein 3D structure in DCRR and the 3D search motif Sequester amino acid residues in protein which are in 3D search motif Select backbone or side chain centroids according to 3D search motif Calculate distances and select those within limits of sides of 3D search motif Select roots associated with three nodes as specified in 3D search motif Select node-edges with lengths within limits of those in 3D search motif

The screening procedure is written as a series of Fortran programs that takes in two inputs, namely, a protein PDB structure file in DCRR, and the dimensions and centroid identities of the tetrahedral motif under query (Figure 6). We first sequester the amino acid residues in the query protein that are found in the 3D SM; in the example, the 3D SM contains the vertices Fb (phe backbone centroid), Es (glu side chain centroid), Ds (asp side chain centroid) and As (ala side chain centroid), thus we would sequester all F, E, D and A residues from the query protein (Figure 7, step #1). Then from the sequestered set of residues above, the appropriate centroids for the backbone or side chain are selected; in the example, the F side chain centroids are discarded, and so are the E, D and A backbone centroids, retaining only Fb, Es, Ds and As centroids, which are precisely what the 3D SM contains (Figure 8, step #2); this reduces the size of the sequestered group in half. Next, the distances between centroids in the sequestered group are calculated, and only those falling within limits of the corresponding branches in the 3D SM are retained (Figure 9, step #3). For example, if the Fb-Es branch is 8.80 Å, then only Fb-Es lengths in the sequestered group falling within 8.80 Å ± İ DUH UHWDLQHG İ LV WKH IX]]\ PDUJLQ DQG ZH usually set it at 1.0 to 1.5 Å. Then, roots that are associated with exactly the three nodes in the 3D SM are chosen, further reducing the size of the sequestered group. We call each such combination of one

8 root and three nodes a 'group', and each is a candidate LBS in eth query protein (Figure 10, step #4). Finally, the lengths of the nodeedges in each group are computed, and those falling within limits of the corresponding edges in the 3D SM are chosen (Figure 11, step #5); we XVH D VLPLODU IX]]\ PDUJLQ İ RI around 1.0 to 1.50 Å here. What is left at this point are potential LBS(s) in the query protein, as it/they have similar parameters as the 3D SM. The set of steps above were all coded in Fortran 77 or 90 programs and are currently in the process of being published in a biological program source code repository journal/database. The above approach of modeling and then Figure 6 screening for the binding site of any given ligand whose 3D structure in complex with its cognate receptor protein using a reduced protein representation is novel. It may be applied not only to LBS modeling and screening but also to structurebased protein function assignment, since the growth of functionally unannotated protein structures in the PDB has been significant due to the many structural genomics studies (Levitt, 2004) currently in operation. We have applied the above procedure to the modeling of the ATP binding site in the ser/thr protein kinase family as well as the GTP binding site in the small, Ras-like G-protein family. We only describe a summary of results here as a more comprehensive manuscript describing the application of the present method to LBS and pharmacophore modeling is in the process of publication elsewhere (V. M. Reyes, in preparation). Briefly, a training set of ATP-binding proteins composed of structures 1B38, 1B39, 1FIN, 1GOL, 1HCK, 1JST, 1PHK, 1QL6, 1QMZ and 2PHK, and a training set of GTP-binding proteins composed of 1E96, 1N6L, 1NVU, 1LOO, 1M7B, 1O3Y and 2RAP were used. Tetrahedral LBS motif models were then built (as described in the Methods section) for each protein family. The performance of each model in Figure 7 screening for ATP and GTP LBSs were validated using a set of 15 'unseen' positive control structures for each family; a set of 30 negative controls were used for both families. The screening algorithm yielded a sensitivity of ~60% and a success rate of ~87% for the ATP-binding family, while it yielded a sensitivity of ~93% and a success rate of ~97% for the GTP-binding family; both have specificities close to 100%. Thus the ATP and GTP 3D SMs built from their respective training sets may be considered robust.

9

Using the models, ~800 solved protein structures in the PDB but without functional annotation were screened for the ATP- and GTP-binding site models, thus assigning potential functions to these structures. The results of this study will be published in a separate submission (ibid.). C. Extension to Protein-Protein Interactions. We have also made a lot of progress towards extending the above procedure to the detection/prediction of proteinprotein interactions (PPI). Our procedure for predicting PPI partners is composed of 4 steps (see Appendix, Figure S2), namely, (1.) Determination of Monomer Interface Interactions; (2.) Determination of the Interface Tetrahedral Search Motif Pair; (3.) Screening the Test Set for the Tetrahedral Search Motif Figure 8 Pair; and (4.) Least-squares superposition of the tetrahedral SM found in the test structures into those in the Tetrahedral motif pair in the model, a step tantamount to docking the two predicted PPI pair together into a binary complex. The procedure starts with the determination of the interactions at the interface of an experimentally solved training structure for the complex under study. We focused our attention on the protein-protein interface because it is well established that it possesses certain special properties, most of which are conserved. Then a tetrahedral 3D SM is constructed from the interface of each monomer, giving rise to a docked pair of 3D SM’s (see Figures 12A and 12B). The set of proteins to be annotated (the “test or application set”) is then screened for each 3D SM, and those testing positive for either motif are deemed candidate PPI partners. Some preliminary results are shown in the Appendix (Tables S1 and S2). IV. Concluding Remarks. A. Conclusions and Future Directions. The idea for DCRR came from the motivation that we needed a simplified yet chemically meaningful protein structure representation. For example, using the all-atom representation for molecular dynamics and protein-protein interaction studies is too computationally uneconomical due to the exceedingly large memory requirements for manipulating and analyzing the sheer number of data points in a protein. The double-centroid reduced representation of proteins is quite appropriate for these types of work: it makes possible the drastic simplification of the protein structure information as each amino acid is represented by two data points, the centroid of the backbone and the centroid of the side-chain, reducing the overall data points by as much as 75%. Since DCRR contains both the backbone and the side-chain information, the essential biochemical information is captured, ZKLFKLVXQOLNHWKH&Į-trace method where all side chain information is lost and ignored.

10 The MATLAB visualization tool has been utilized for the visualization of the protein DCRR. MATLAB is an excellent visualization interface allowing users to rotate, zoom in, zoom out and translate the protein for a better 3D view. Our DCRR visualization tool allows simultaneous display of the secondary structures of the protein as well as the H-bonds and the van der Waals interactions. It also shows any ligand(s) bound to the protein as well as bound ordered water molecules. Our DCRR tool provides a good view of the ligand binding site as well as the proteinwater and the protein-ligand interactions. The visualization script has been programmed to include a legend for easy identification of amino acids and ligands. Unlike other protein visualization tools, our DCRR tool allows easy identification of the amino acid residues at each site as each side chain centroid is labeled and color-coded according to its Figure 9 polarity and hydrophobicity. We also think it is more user-friendly compared to other visualization tools where the user needs to click on each point to identify the amino acid. Another advantage of using MATLAB for visualization is the fact that it is freely available in most if not all academic and research institutions, and all undergraduate science and engineering students are proficient in it, as it is also used in mathematics, statistics and engineering courses. Moreover, most if not all colleges and universities offer free MATLAB tutorials to all of their entering students precisely for the above purpose. Our Protein DCRR Web Server contains a database of the precomputed DCRR coordinates of most of the x-ray crystallographic structures currently deposited in the PDB, save for a few thousand structures containing segment breaks which our wrapper script cannot handle currently; we shall rectify this minor "bug" in the near future. Our DCRR web server also currently does not contain NMR structures, but we plan to include them in future versions of our software and web server.

Figure 10

11 Finally, using the DCRR method, we have also developed and implemented a method to mathematically and computationally model the binding site of a given ligand, and then screen any protein 3D structure for the presence of that LBS model. We term the model 'tetrahedral motif' as it composed of four points corresponding to backbone or side chain centroids of the amino acids contacting the ligand at the ligand binding site. This combined modeling and screening approach is novel in the sense that it employs a reduced protein representation, namely the 'doublecentroid reduced representation' or DCRR. The entire algorithm is written in Fortran 77/90 code and run on a UNIX platform, and is thus fast and especially amenable to batch, high-throughput Figure 11 implementations. It is of note that our screening method has both high specificity and sensitivity (see partial ROC curves in Appendix, Figure S1). The set of programs are currently in the process of being deposited in a journal/repository for biologically applicable source codes. Future directions of the present research would include (a.) application of the method (modeling and screening) to biologically important ligands other than ATP and GTP, (b.) use of not one but two tetrahedral motifs + for large ligands (e.g., NAD/NADP ), and (c.) extension of the method (applied here to proteinligand interactions) to protein-protein interactions (in progress).

Figure 12 A

B. Preservation/Documentation/Sharing of Data & Related Research/Education Products. Results of our work from this project will first be presented as posters in scientific conferences, and, after further development and refinement, will be written up as manuscripts for publication in peerreviewed scientific journals. They will be made available to the community in the form of databases and web servers. Unpublished results and other details will be available upon request from the PI; those which are deemed patentable by our institute's intellectual property management office will be put up for patent and made available to academic or non-profit requestors upon formal agreement of confidentiality. This project is especially useful pedagogically since it is a new

way to look at and analyze protein 3D structures. V. Broader Impact. The author teaches courses in proteomics and bioinformatics and has always thought about how to best integrate basic research into classroom teaching. At RIT he has initiated a researchedbased teaching method in order to advance discovery while promoting teaching, training and learning. In this method, the students are given a "freestyle" laboratory exercise involving an open problem in the field. He will present the successful results of this innovative teaching technique in a video clip at ICERI 2010 (International Conference on Education, Research and Innovation). The author will continue to improve and

12 refine his research-based pedagogical method as he continues to teach more and more varying courses at the institute. As for the broadening the participation of underrepresented groups, we note that our institute has been the home of the National Technical Institute for the Deaf (NTID) since 1967. As such we have the unique opportunity to increase the participation of this disadvantaged group into our research programs. Our institute also attracts and serves many talented African- and Latino-American students, as well as foreign nationals; for example, we attract many talented students from Malaysia and India. As a means to enhance infrastructure for basic research and education, we propose to create databases for reduced representations of proteins and nucleic acids (DCRR and TCRR) and transform all current structures in the PDB (http://www.pdb.org/pdb/home/home.do) and NDB Figure 12 B (http:// ndbserver. rutgers. edu/). It is hoped that these two novel resources will be useful not only to researchers in the computational biology field but also for K-12 students and educators. The author will strongly encourage his graduate students and post-doc to give research talks in one of several seminar series in place within our the institute as well as within the Rochester-Buffalo-Ithaca area. The author along with his post-doc, technician and graduate students plan to actively participate in national as well as international meetings in order to disseminate their work. These activities will be the prelude to the publication of their findings in primary, peer-reviewed scientific journals. Among the applications of this proposed work is the modeling and screening for ligand binding sites, which has significant implications on drug design. The datasets being used by the author and his students currently are proteins of human origin, and it is hoped that this human focus would benefit society in general. Finally, the author intends to make sure that his post-doc, technician and undergraduate and graduate students are all trained in the scientific method and way of thinking, so they can become the independent scientific researchers of the future. VI. Management Plan. In addition to regular e-mail communications, the author will hold 1- to 2-hour meetings every other week with his graduate students, post-doc and technician for research progress reports and other pertinent matters. In these meetings, everyone, especially the author, will brainstorm to find solutions to any problems anyone is having with his/her research, and to present new ideas for solving/approaching research problems on hand. These meetings will also be a forum where anyone can pose new scientific questions (those within the realms of the author's scientific program) for possible analysis and proposed solution by everyone in the group. Occasional lectures by the author dealing with certain special topics relevant to his research program will also be held during these meetings. These meetings will ultimately serve as a training ground for the students, post-doc and technician in the scientific method and way of thinking. Financial and administrative matters related to the grant will be handled in coordination with the institute's Office of Sponsored Research. VII. Miscellaneous Items. A. Keywords: protein reduced representation; ligand binding site; pharmacophore modeling; protein structure visualization; double-centroid representation; B. Abbreviations: AAR, all-atom representation; DCRR, double-centroid reduced representation; H-bond, hydrogen bond; VDW, van der Waals; LBS, ligand binding site; BS, binding site; PDB, Protein Data Bank; TM, tetrahedral motif.

13 C. Definitions: 3D Motif: A specific local 3D arrangement of specific protein atoms (from its backbone or side chains) created when they are brought close together in space by protein folding; the residues involved may or may not be contiguous in primary sequence 3D Search Motif: A 3D motif that is encoded in a computer program to be used for screening structures, usu. of proteins, using a search algorithm; in the present context, this is the tetrahedral motif corresponding to a LBS Centroid: In the sense used this article, the unweighted geometric centroid of a group of neighboring atoms, considering only their x-, y-, and z-coordinates, and without consideration of their atomic masses All-Atom Representation: The usual representation of protein 3D structures, as in the PDB, where each atom (usually the non-Hydrogen ones) has its own coordinates, usually Cartesian, (x,y,z). Double-Centroid Representation: A protein 3D reduced representation wherein each amino acid is represeQWHGE\WZRFHQWURLGVWKDWRIWKHEDFNERQHDWRPV 1&Į& 2 DQGWKDWRIWKHVLGHFKDLQDWRPV &ȕDQGEH\RQG Ligand Binding Site: The specific site on a protein, usually a crevice or a pocket of varying depth, where a ligand binds in a specific geometry and orientation, and with high specificity Pharmacophore: A subset of the 3D structural features of a ligand that are specifically recognized at its binding site in its cognate protein receptor molecule and are essential for its biological action(s) Pharmacophore Modeling: The extraction of the essential geometric and electrostatic (i.e., chemical) properties of a ligand, preferably in the form of a specific data structure for computational input, that are essential for its biological function; an important step in LBS screening and drug design Reduced Representation: A method of representing macromolecules, usually with a visual component, where the atomicity (number of coordinates) is significantly reduced compared to the all-atom representation (usually derived from an x-ray crystallographic model) Tetrahedral Motif: the reduced representation of a particular LBS composed of four points which are centroids of the backbone or sidechain atoms in the protein contacting the ligand via H-bonding or VDW interaction at its LBS; on vertex is denoted as the 'root', and the other three 'nodes' 1, 2 and 3. VIII. Appendix.

Figure S1

To estimate the sensitivity and specificity of the algorithm, positive and negative controls were performed and partial ROC curves were constructed. Fifteen positive control structures for the ATP-binding ser/thr protein kinase family and another 15 positive control structures for the GTP-binding small, Ras-type G-protein family were selected from the PDB. Another 30 negative controls structures were chosen and used for both above families. The partial ROC curves for both families are shown. For the ATP-binding family, our algorithm showed a sensitivity of ~60%, a specificity of nearly 100%, a success rate of ~86%, and a Matthews correlation coefficient of ~70%. For the GTP-binding family, our algorithm had a sensitivity of ~93%, a specificity of nearly 100%, a success rate of ~97%, and a Matthews correlation coefficient of ~95%.

14

Figure S2 A schematic diagram of the procedure for the prediction of PPI partners is shown. An experimental 3D structure of the binary complex being studied is required, as the interface 3D search motif pair is derived from it. The interface 3D search motif is composed of two "docked" 3D SMs, one from each protomer of the binary complex, designated "A interface motif" and "B interface motif" . Using our screening algorithm, the test structures are screened twice: once for the A interface motif, and a second time for the B interface motif. Structures testing positive for the former, and those testing positive for the latter, are candidate (putative) PPI partners. They will then be "docked" together using least-squares superposition methods.

Figure S1

15

Figure S1 (cont'd.)

The 9 training structures (1 for each complex) are shown in the Table S1; they are: (1.) 1c1y: RAP-Gmppnp/c-RAF1 Ras-binding protein; (2.) 1cxz: RHOA/Protein Kinase (PKN/PRK1) effector domain; (3.) 1ds6: RAC/RHOGD1; (4.) 1e96: RAC/P67PHOX; (5.) 1fq1: kinase-associated phosphatase (KAP)/phosphoCDK2; (6.) 1fc2: immunoglobulin Fc fragment/protein A fragment B; (7.) 1mco: immunoglobulin light chain dimer (BenceJones protein); (8.) 1jdh: beta-catenin/HTCF-4; and (9.) 3ink: interleukin-2 homodimer. They are designated as complex A, B, C, D, H, I, P, Q and Z, respectively.

Table S2

Some 801 structures in the PDB without known functions were used as test set. These 801 experimentally solved test structures were screened for each tetrahedron using the screening algorithm developed earlier for protein-ligand interactions. The overall results of screening the 801 application structures are shown in Table S2, where the number of test structures testing positive for the monomers in the training structures, as well as the percentage of structures testing positive in the test set for each binary complex.

1

REFERENCES:

Barlow, T.W. & Richards, W.G. (1995) "A Novel Representation of Protein Structure" J. Molec. Graphics, 13:373-376 Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N., & Bourne, P. E. (2000). The Protein Data Bank. Nucleic Acids Res, 28 (1), 235-242. Bondi, A. (1964). van der Waals volumes and radii. J. Phys Chem, 68(3), 441-451. Guex, N., Diemand, A., & Peitsch, M. C. (1999) Protein modelling for all. Trends Biochem Sci, 24(9), 364-367. Güner, O., Clement, O., & Kurogi, Y. (2004) Pharmacophore modeling and three dimensional database searching for drug design using catalyst: recent advances. Curr Med Chem, 11(22), 2991-3005. Guner, O. F. (2005) The impact of pharmacophore modeling in drug design. I Drugs, 8(7), 567-72. Hopfinger, A. J., & Duca, J. S. (2000). Extraction of pharmacophore information from high-throughput screens. Curr Opin Biotechnol. 11(1), 97-103. Jeffrey, G. A. (1997) An Introduction to Hydrogen Bonding. Oxford Univ. Press, Pittsburg, PA, U.S.A. Khedkar, S. A., Malde, A. K., Coutinho, E. C., Srivastava, S. (2007) Pharmacophore modeling in drug discovery and development: an overview. Med Chem, 3(2), 187-97. Kobayashi, N., & Go, N. (1997). A method to search for similar protein local structures at ligand binding sites and its application to adenine recognition. Eur Biophys J, 26(2), 135-44. Kobayashi, N., & Go, N. (1997). ATP binding proteins with different folds share a common ATP-binding structural motif. Nature Struct Biol, 4(1), 6-7. Kolinski, A. (2004) Protein modeling and structure prediction with a reduced representation. Acta Biochim Polonica, 51(2) 349-371. Kuzmin, V. S., & Katser, S. B. (2005) Calculations of van der Waals volumes of Organic Molecules.” Russian Chem Bull 41(4) 720-727. Levitt, M. (2007) Growth of Novel Protein Structural Data. Proc Nat Acad Sci, 104(9), 3183-3188. Li, X., Hu, C. & Liang, J. (2003) " Simplicial Edge Representation of Protein Structures and Alpha Contact Potential with Confidence Measure." Proteins: Structure, Function and Genetics, 53:792-805 Liwo, A., Oldziej, S., Pincus, M. R., Wawak, R. J., Rackovsky, S., & Scheraga, H. A. (1997). A unitedresidue force field for off-lattice protein-structure simulations. I. Functional forms and parameters of longrange side-chain interaction potentials from protein crystal data. J Comput Chem. 18(7), 849–873. Mason, J. S., Good, A. C., & Martin, E. J. (2001) 3-D Pharmacophores in drug discovery. Curr Pharm Des. 7(7), 567-97. Nyburg, S. C., & Faerman, C. H. (1985) A revision of van der Waals atomic radii for molecular crystals: N, O, F, S, Cl, Se, Br and I bonded to carbon. Acta Crystallogr B, 41(4), 274-279.

2 Reyes, V.M*. & Sheth, V.N., "Visualization of Protein 3D Structures in 'Double-Centroid' Reduced Representation: Application to Ligand Binding Site Modeling and Screening", Handbook of Research in Computational and Systems Biology: Interdisciplinary Approaches, IGI-Global/Springer (*corresponding author; in press). Richardson, D. C., & Richardson, J. S. (1992). The kinemage: a tool for scientific communication. Protein Sci, 1(1), 3-9. Sayle, R. A., & Milner, J.E. (2000). Rasmol: Biomolecular graphics for all. Trends in Biochem Sci, 20(9), 374-376. Schwede, T., Kopp, J., Guex, N., & Petsch, M. C. (2003). SWISS-MODEL: an automated protein homology-modeling server. Nucleic Acids Res, 31(13), 3381-3385. Sun, H. (2008) Pharmacophore-based virtual screening. Curr Med Chem, 15(10), 1018-24.

1 VICENTE M. REYES, Ph.D. Dept. Biological Sciences, Sch. Biol. & Med. Sciences College of Science, Gosnell 08-1336 Rochester Institute of Technology Rochester, NY 14623-5603

Tel: (585) 475-4115 Cell: (619) 212-9131 E-mail: [email protected] [email protected]

(a) Professional Preparation:

• Univ. of the Philippines, Diliman, Philippines (conc. in pure mathematics & operations research),

Mathematics, B.S. (magna cum laude), 1980 • Univ. of the Philippines, Diliman, Philippines (conc. in organic chemistry & biochemistry), Chemistry, B.S. (magna cum laude), 1980 • California Institute of Technology, Pasadena, California, USA, (conc. in molecular biology & biochemistry), Chemistry, Ph.D., 1988 • UCSD School of Extended Studies, Spec. Cert. In Bioinformatics (Spring 2002) • UCSD School of Extended Studies, Prof. Cert. in Bioinformatics (Spring 2004) • UCSD School of Extended Studies, Spec. Cert. in Data Mining (Winter 2007) (b) Appointments:

• Assistant Professor, Dept. of Biological Sciences, SBMS, COS, R.I.T., 9/2008- present

(computational biology/bioinformatics) • IRACDA Postdoctoral Fellow & Assistant Project Scientist, UCSD Dept. of Pharmacology, SOM, 2004-'08 (computational biology/structural bioinformatics) • Structural Bioinformatics Researcher, San Diego Supercomputer Center, 2002-‘04 (structural bioinformatics) • Bioinformatics studies, UCSD School of Extended Studies, La Jolla, CA, 2000-‘02 (general bioinformatics) • Senior Research Associate, The Scripps Research Institute, La Jolla, CA, 1995-‘00 (protein x-ray crystallography; structure-based drug design) • Postdoctoral Biochemist, Dept. of Chem. & Biochem., UCSD, La Jolla, CA, 1992-’95 (protein x-ray crystallography; structural enzymology) • Postdoctoral Biologist, Dept’s. of Biol. & Med., UCSD, La Jolla, CA 1990-‘92 (HIV/AIDS molecular biology) • Postdoctoral Research Fellow, Lab.Tumor Cell Biol., NCI/NIH, Bethesda, MD 1988-‘89 (HIV/AIDS molecular biology) • Graduate Student & Teaching Assistant, Dept. of Biol., CIT, Pasadena, CA 1983-‘88 (gene expression/molecular biology) • Instructor in Mathematics, Dept. of Math., Univ. of the Phils., Diliman, Phils., 1980-‘82 (differential and integral calculus I, II, and III; probability & statistics) (c) Publications: (i) Most closely related to proposed project: • Reyes, V.M., "Representation of Protein 3D Structures in Spherical (ρ,φ,θ) Coordinates and Two of Its Potential Applications" (accepted for publication, iCCSB 2010). • Reyes, V.M*. & Sheth, V.N., "Visualization of Protein 3D Structures in 'Double-Centroid' Reduced Representation: Application to Ligand Binding Site Modeling and Screening", Handbook of Research in Computational and Systems Biology: Interdisciplinary Approaches, IGI-Global/Springer (*corresponding author; in press). • Reyes, V.M., "Modeling Protein-Protein Interface Interactions as a Means for Predicting ProteinProtein Interaction Partners." J. Biomol. Struct. & Dyn., Book of Abstracts, Albany 2009: The 16th Conversation, June 16-20 2009, Vol. 26 (6) June 2009, p. 873

2 • Reyes, V.M., "Pharmacophore Modeling Using a Reduced Protein Representation as a Tool for Structure-Based Protein Function Prediction", J. Biomol. Struct. & Dyn., Book of Abstracts, Albany 2009: The 16th Conversation, June 16-20 2009, Vol. 26 (6) June 2009, p. 873 • Reyes, V.M., "Representing Protein 3D Structures in Spherical Coordinates – Two Applications: 1. Detection of Invaginations, Protrusions, and Potential Ligand Binding Sites; and 2. Separation of Protein Hydrophilic Outer Layer from the Hydrophobic Core ", J. Biomol. Struct. & Dyn., Book of Abstracts, Albany 2009: The 16th Conversation, June 16-20 2009, Vol. 26 (6) June 2009, pp. 874-5 (ii) Other Significant Publications (after 1990): • Reyes, V.M., "Pharmacophore Modeling Using a Reduced Protein Representation: Application to the Prediction of ATP, GTP, Sialic Acid, Retinoic Acid, and Heme-Bound and -Unbound Nitric Oxide Binding Proteins", J. Biomol. Struct. & Dyn., Book of Abstracts, Albany 2009: The 16th Conversation, June 16-20 2009, Vol. 26 (6) June 2009, p. 874 • Li, W., Byrnes, R.W., Hayes, J., Birnbaum, A., Reyes, V.M., Shahab, A., Mosley, C., Pekurovsky, D., Quinn, G.B., Shindyalov, I.N., Casanova, H., Ang, L., Berman, F., Arzberger, P.W., Miller, M., Bourne, P.E. “The Encyclopedia of Life Project: Grid Software and Deployment.” New Gener. Comp. (2004) 22:127-136. • Reyes, V.M., Greasley, S.E., Stura, E.A., Beardsley, G.P., Wilson, I.A. “Crystallization and preliminary crystallographic investigations of avian 5-aminoimidazole-4-carboxamide ribonucleotide transformylaseinosine monophosphate cyclohydrolase expressed in Escherichia coli.” Acta Crystallogr D Biol Crystallogr. (2000) Aug;56 (Pt 8):1051-4. • Lee, H., *Reyes, V.M., Kraut, J. “Crystal structures of Escherichia coli dihydrofolate reductase complexed with 5-formyltetrahydrofolate (folinic acid) in two space groups: evidence for enolization of pteridine O4.” Biochemistry. (1996) Jun 4;35(22):7012-20. (*corresponding author) • Reyes, V.M., Sawaya, M.R., Brown, K.A., Kraut, J. “Isomorphous crystal structures of Escherichia coli dihydrofolate reductase complexed with folate, 5-deazafolate, and 5,10-dideazatetrahydrofolate: mechanistic implications.” Biochemistry. (1995) Feb 28;34(8):2710-23. (d) Synergistic Activities: • Member, External Faculty, Ph.D. Program of the Golisano Institute of Computing and Information Sciences, R.I.T. (Prof. P.-C. Shi, Dept. of Computer Science, director) • Member, Center for Applied and Computational Mathematics, Dept. of Mathematics, R.I.T. (Prof. A. Harkin, Dept. of Mathematics, director) • Rochester Inst. of Techn./Rochester Gen. Hosp. Biomedical Research & Programs Alliance, Sum. 2009 • Encyclopedia of Life/Dictyostelium discoideum proteome project with Prof. W. Loomis, UCSD Dept. of Biology, 2003-2004, and Drs. W. Li, G. Quinn & P. Bourne, SDSC, 2002-2006 • Bio 101 team-teaching project, directed by Prof. R. Pozos, SDSU, under IRACDA program, 2004-2008. (e) Collaborators & Other Affiliations: (i) Collaborators: • Paul Craig (RIT): Proteomics team-teaching, spring 2009 and 2010) • Lea Michel (RIT) & Michael Pichichero (RGH): vaccine development project, summer 2009 (ii) Graduate/Postdoctoral Advisors: • Ph.D. Dissertation Advisor: Prof. John Abelson, Dept. of Biology, Caltech • Postdoctoral Research Mentor: Prof. Joseph Kraut, Dept. of Chem. & Biochem., U.C. San Diego • Postdoctoral Research Sponsors: Drs. F. Wong-Staal/R. Gallo (NCI/NIH); Dr. I. Wilson (TSRI); Drs. L. Brunton (UCSD)/R. Pozos (SDSU)/P. Bourne (SDSC) (iii) Thesis Advisees & Mentees (Total: 9): (1) Vrunda Sheth (M.S., graduated 2009; Applied Biosystems); (2) Mark McCreary (M.S., RIT, current); (3) Arkanjan Banerjee (M.S., RIT, current); (4) Srujana Reddy Cheguri (M.S., RIT, current); (5) Dong Jin Kim (M.S., RIT, current); (6) Andrew Clark (M.S., RIT, current); (7) Wan Munirah Wan Mohamad (B.S., RIT, current); (8) Madolyn MacDonald (B.S., RIT, current); (9) Muhamad Hanafi Hazemi (B.S., graduated 2010, RIT)

SUMMARY PROPOSAL BUDGET

YEAR

1

FOR NSF USE ONLY PROPOSAL NO. DURATION (months) Proposed Granted AWARD NO.

ORGANIZATION

Rochester Institute of Tech PRINCIPAL INVESTIGATOR / PROJECT DIRECTOR

Vicente M Reyes A. SENIOR PERSONNEL: PI/PD, Co-PI’s, Faculty and Other Senior Associates (List each separately with title, A.7. show number in brackets)

NSF Funded Person-months

CAL

ACAD

1. Vicente M Reyes - PI 0.00 0.50 2. 3. 4. 5. 6. ( 0 ) OTHERS (LIST INDIVIDUALLY ON BUDGET JUSTIFICATION PAGE) 0.00 0.00 7. ( 1 ) TOTAL SENIOR PERSONNEL (1 - 6) 0.00 0.50 B. OTHER PERSONNEL (SHOW NUMBERS IN BRACKETS) 1. ( 1 ) POST DOCTORAL SCHOLARS 11.76 0.00 2. ( 1 ) OTHER PROFESSIONALS (TECHNICIAN, PROGRAMMER, ETC.) 6.00 0.00 3. ( 4 ) GRADUATE STUDENTS 4. ( 2 ) UNDERGRADUATE STUDENTS 5. ( 0 ) SECRETARIAL - CLERICAL (IF CHARGED DIRECTLY) 6. ( 0 ) OTHER TOTAL SALARIES AND WAGES (A + B) C. FRINGE BENEFITS (IF CHARGED AS DIRECT COSTS) TOTAL SALARIES, WAGES AND FRINGE BENEFITS (A + B + C) D. EQUIPMENT (LIST ITEM AND DOLLAR AMOUNT FOR EACH ITEM EXCEEDING $5,000.)

$

Terabyte Storage (3 terabytes of storage space)

TOTAL EQUIPMENT E. TRAVEL 1. DOMESTIC (INCL. CANADA, MEXICO AND U.S. POSSESSIONS) 2. FOREIGN

F. PARTICIPANT SUPPORT COSTS 0 1. STIPENDS $ 0 2. TRAVEL 0 3. SUBSISTENCE 0 4. OTHER TOTAL NUMBER OF PARTICIPANTS ( 0) G. OTHER DIRECT COSTS 1. MATERIALS AND SUPPLIES 2. PUBLICATION COSTS/DOCUMENTATION/DISSEMINATION 3. CONSULTANT SERVICES 4. COMPUTER SERVICES 5. SUBAWARDS 6. OTHER TOTAL OTHER DIRECT COSTS H. TOTAL DIRECT COSTS (A THROUGH G) I. INDIRECT COSTS (F&A)(SPECIFY RATE AND BASE)

SUMR

Funds Requested By proposer

Funds granted by NSF (if different)

0.50 $

7,663 $

0.00 0.50

0 7,663

0.00 0.00

39,200 15,000 11,520 5,500 0 0 78,883 18,390 97,273

3,000

3,000 12,000 0

0

TOTAL PARTICIPANT COSTS

5,600 0 0 2,111 0 0 7,711 119,984

Modified Total Direct Costs (Rate: 44.5000, Base: 116984) TOTAL INDIRECT COSTS (F&A) J. TOTAL DIRECT AND INDIRECT COSTS (H + I) K. RESIDUAL FUNDS L. AMOUNT OF THIS REQUEST (J) OR (J MINUS K) M. COST SHARING PROPOSED LEVEL $ PI/PD NAME

Vicente M Reyes ORG. REP. NAME*

52,058 172,042 0 172,042 $

0

$ AGREED LEVEL IF DIFFERENT $ FOR NSF USE ONLY INDIRECT COST RATE VERIFICATION Date Checked

Date Of Rate Sheet

fm1030rs-07

Initials - ORG

1 *ELECTRONIC SIGNATURES REQUIRED FOR REVISED BUDGET

SUMMARY PROPOSAL BUDGET

YEAR

2

FOR NSF USE ONLY PROPOSAL NO. DURATION (months) Proposed Granted AWARD NO.

ORGANIZATION

Rochester Institute of Tech PRINCIPAL INVESTIGATOR / PROJECT DIRECTOR

Vicente M Reyes A. SENIOR PERSONNEL: PI/PD, Co-PI’s, Faculty and Other Senior Associates (List each separately with title, A.7. show number in brackets)

NSF Funded Person-months

CAL

ACAD

1. Vicente M Reyes - PI 0.00 0.50 2. 3. 4. 5. 6. ( 0 ) OTHERS (LIST INDIVIDUALLY ON BUDGET JUSTIFICATION PAGE) 0.00 0.00 7. ( 1 ) TOTAL SENIOR PERSONNEL (1 - 6) 0.00 0.50 B. OTHER PERSONNEL (SHOW NUMBERS IN BRACKETS) 1. ( 1 ) POST DOCTORAL SCHOLARS 11.76 0.00 2. ( 1 ) OTHER PROFESSIONALS (TECHNICIAN, PROGRAMMER, ETC.) 6.00 0.00 3. ( 4 ) GRADUATE STUDENTS 4. ( 2 ) UNDERGRADUATE STUDENTS 5. ( 0 ) SECRETARIAL - CLERICAL (IF CHARGED DIRECTLY) 6. ( 0 ) OTHER TOTAL SALARIES AND WAGES (A + B) C. FRINGE BENEFITS (IF CHARGED AS DIRECT COSTS) TOTAL SALARIES, WAGES AND FRINGE BENEFITS (A + B + C) D. EQUIPMENT (LIST ITEM AND DOLLAR AMOUNT FOR EACH ITEM EXCEEDING $5,000.)

TOTAL EQUIPMENT E. TRAVEL 1. DOMESTIC (INCL. CANADA, MEXICO AND U.S. POSSESSIONS) 2. FOREIGN

F. PARTICIPANT SUPPORT COSTS 0 1. STIPENDS $ 0 2. TRAVEL 0 3. SUBSISTENCE 0 4. OTHER TOTAL NUMBER OF PARTICIPANTS ( 0) G. OTHER DIRECT COSTS 1. MATERIALS AND SUPPLIES 2. PUBLICATION COSTS/DOCUMENTATION/DISSEMINATION 3. CONSULTANT SERVICES 4. COMPUTER SERVICES 5. SUBAWARDS 6. OTHER TOTAL OTHER DIRECT COSTS H. TOTAL DIRECT COSTS (A THROUGH G) I. INDIRECT COSTS (F&A)(SPECIFY RATE AND BASE)

SUMR

Funds Requested By proposer

Funds granted by NSF (if different)

0.50 $

7,893 $

0.00 0.50

0 7,893

0.00 0.00

40,376 15,450 11,520 5,500 0 0 80,739 19,512 100,251

0 3,000 0

0

TOTAL PARTICIPANT COSTS

0 2,000 0 2,111 0 0 4,111 107,362

Modified Total Direct Costs (Rate: 44.5000, Base: 107362) TOTAL INDIRECT COSTS (F&A) J. TOTAL DIRECT AND INDIRECT COSTS (H + I) K. RESIDUAL FUNDS L. AMOUNT OF THIS REQUEST (J) OR (J MINUS K) M. COST SHARING PROPOSED LEVEL $ PI/PD NAME

Vicente M Reyes ORG. REP. NAME*

47,776 155,138 0 155,138 $

0

$ AGREED LEVEL IF DIFFERENT $ FOR NSF USE ONLY INDIRECT COST RATE VERIFICATION Date Checked

Date Of Rate Sheet

fm1030rs-07

Initials - ORG

2 *ELECTRONIC SIGNATURES REQUIRED FOR REVISED BUDGET

SUMMARY PROPOSAL BUDGET

YEAR

3

FOR NSF USE ONLY PROPOSAL NO. DURATION (months) Proposed Granted AWARD NO.

ORGANIZATION

Rochester Institute of Tech PRINCIPAL INVESTIGATOR / PROJECT DIRECTOR

Vicente M Reyes A. SENIOR PERSONNEL: PI/PD, Co-PI’s, Faculty and Other Senior Associates (List each separately with title, A.7. show number in brackets)

NSF Funded Person-months

CAL

ACAD

1. Vicente M Reyes - PI 0.00 0.50 2. 3. 4. 5. 6. ( 0 ) OTHERS (LIST INDIVIDUALLY ON BUDGET JUSTIFICATION PAGE) 0.00 0.00 7. ( 1 ) TOTAL SENIOR PERSONNEL (1 - 6) 0.00 0.50 B. OTHER PERSONNEL (SHOW NUMBERS IN BRACKETS) 1. ( 1 ) POST DOCTORAL SCHOLARS 11.76 0.00 2. ( 1 ) OTHER PROFESSIONALS (TECHNICIAN, PROGRAMMER, ETC.) 6.00 0.00 3. ( 4 ) GRADUATE STUDENTS 4. ( 2 ) UNDERGRADUATE STUDENTS 5. ( 0 ) SECRETARIAL - CLERICAL (IF CHARGED DIRECTLY) 6. ( 0 ) OTHER TOTAL SALARIES AND WAGES (A + B) C. FRINGE BENEFITS (IF CHARGED AS DIRECT COSTS) TOTAL SALARIES, WAGES AND FRINGE BENEFITS (A + B + C) D. EQUIPMENT (LIST ITEM AND DOLLAR AMOUNT FOR EACH ITEM EXCEEDING $5,000.)

TOTAL EQUIPMENT E. TRAVEL 1. DOMESTIC (INCL. CANADA, MEXICO AND U.S. POSSESSIONS) 2. FOREIGN

F. PARTICIPANT SUPPORT COSTS 0 1. STIPENDS $ 0 2. TRAVEL 0 3. SUBSISTENCE 0 4. OTHER TOTAL NUMBER OF PARTICIPANTS ( 0) G. OTHER DIRECT COSTS 1. MATERIALS AND SUPPLIES 2. PUBLICATION COSTS/DOCUMENTATION/DISSEMINATION 3. CONSULTANT SERVICES 4. COMPUTER SERVICES 5. SUBAWARDS 6. OTHER TOTAL OTHER DIRECT COSTS H. TOTAL DIRECT COSTS (A THROUGH G) I. INDIRECT COSTS (F&A)(SPECIFY RATE AND BASE)

SUMR

Funds Requested By proposer

Funds granted by NSF (if different)

0.50 $

8,130 $

0.00 0.50

0 8,130

0.00 0.00

41,587 15,914 11,520 5,500 0 0 82,651 20,686 103,337

0 10,000 0

0

TOTAL PARTICIPANT COSTS

0 2,000 0 2,111 0 0 4,111 117,448

Modified Total Direct Costs (MTDC) (Rate: 44.5000, Base: 117448) TOTAL INDIRECT COSTS (F&A) J. TOTAL DIRECT AND INDIRECT COSTS (H + I) K. RESIDUAL FUNDS L. AMOUNT OF THIS REQUEST (J) OR (J MINUS K) M. COST SHARING PROPOSED LEVEL $ PI/PD NAME

Vicente M Reyes ORG. REP. NAME*

52,264 169,712 0 169,712 $

0

$ AGREED LEVEL IF DIFFERENT $ FOR NSF USE ONLY INDIRECT COST RATE VERIFICATION Date Checked

Date Of Rate Sheet

fm1030rs-07

Initials - ORG

3 *ELECTRONIC SIGNATURES REQUIRED FOR REVISED BUDGET

SUMMARY PROPOSAL BUDGET

Cumulative FOR NSF USE ONLY PROPOSAL NO. DURATION (months) Proposed Granted AWARD NO.

ORGANIZATION

Rochester Institute of Tech PRINCIPAL INVESTIGATOR / PROJECT DIRECTOR

Vicente M Reyes A. SENIOR PERSONNEL: PI/PD, Co-PI’s, Faculty and Other Senior Associates (List each separately with title, A.7. show number in brackets)

NSF Funded Person-months

CAL

ACAD

1. Vicente M Reyes - PI 0.00 1.50 2. 3. 4. 5. 6. ( ) OTHERS (LIST INDIVIDUALLY ON BUDGET JUSTIFICATION PAGE) 0.00 0.00 7. ( 1 ) TOTAL SENIOR PERSONNEL (1 - 6) 0.00 1.50 B. OTHER PERSONNEL (SHOW NUMBERS IN BRACKETS) 1. ( 3 ) POST DOCTORAL SCHOLARS 35.28 0.00 2. ( 3 ) OTHER PROFESSIONALS (TECHNICIAN, PROGRAMMER, ETC.) 18.00 0.00 3. ( 12 ) GRADUATE STUDENTS 4. ( 6 ) UNDERGRADUATE STUDENTS 5. ( 0 ) SECRETARIAL - CLERICAL (IF CHARGED DIRECTLY) 6. ( 0 ) OTHER TOTAL SALARIES AND WAGES (A + B) C. FRINGE BENEFITS (IF CHARGED AS DIRECT COSTS) TOTAL SALARIES, WAGES AND FRINGE BENEFITS (A + B + C) D. EQUIPMENT (LIST ITEM AND DOLLAR AMOUNT FOR EACH ITEM EXCEEDING $5,000.)

$

TOTAL EQUIPMENT E. TRAVEL 1. DOMESTIC (INCL. CANADA, MEXICO AND U.S. POSSESSIONS) 2. FOREIGN

F. PARTICIPANT SUPPORT COSTS 0 1. STIPENDS $ 0 2. TRAVEL 0 3. SUBSISTENCE 0 4. OTHER TOTAL NUMBER OF PARTICIPANTS ( 0) G. OTHER DIRECT COSTS 1. MATERIALS AND SUPPLIES 2. PUBLICATION COSTS/DOCUMENTATION/DISSEMINATION 3. CONSULTANT SERVICES 4. COMPUTER SERVICES 5. SUBAWARDS 6. OTHER TOTAL OTHER DIRECT COSTS H. TOTAL DIRECT COSTS (A THROUGH G) I. INDIRECT COSTS (F&A)(SPECIFY RATE AND BASE) TOTAL INDIRECT COSTS (F&A) J. TOTAL DIRECT AND INDIRECT COSTS (H + I) K. RESIDUAL FUNDS L. AMOUNT OF THIS REQUEST (J) OR (J MINUS K) M. COST SHARING PROPOSED LEVEL $ PI/PD NAME

Vicente M Reyes ORG. REP. NAME*

SUMR

Funds Requested By proposer

Funds granted by NSF (if different)

1.50 $

23,686 $

0.00 1.50

0 23,686

0.00 0.00

121,163 46,364 34,560 16,500 0 0 242,273 58,588 300,861

3,000

3,000 25,000 0

0

TOTAL PARTICIPANT COSTS

5,600 4,000 0 6,333 0 0 15,933 344,794

152,098 496,892 0 496,892 $

0

$ AGREED LEVEL IF DIFFERENT $ FOR NSF USE ONLY INDIRECT COST RATE VERIFICATION Date Checked

Date Of Rate Sheet

fm1030rs-07

Initials - ORG

C *ELECTRONIC SIGNATURES REQUIRED FOR REVISED BUDGET

Budget Justification This project will begin on September 1, 2011, and it will end on August 31, 2014. All of the work for this project will take place at RIT. Senior Personnel: Salaries and Wages Dr. Vicente Reyes (PI), Assistant Professor of Biological Sciences, will devote 0.5 academic month of effort and 0.5 summer month of effort to the project per year. We request NSF funds for the salary associated with this effort. Dr. Reyes will devote additional effort as necessary, funded by RIT, to accomplish the goals of the project. Dr. Reyes will supervise the students (graduate and undergraduate) as well as post-doc and technician in his group. He will write any necessary Fortran programs to implement the required algorithms for his projects and have the students and post-doc test them on various datasets. He will also write the outlines for manuscripts and reviews for publication, and let his students and post-doc write the details, as part of their training. Dr. Reyes works at RIT on a 9.5-month academic contract. The salary request for his effort in Year 1 is based upon his projected 2010-2011 annual base salary. Summer salary at RIT is calculated as the prior academic year’s annual salary, multiplied by 26.3%, multiplied by the proportion of the summer that the faculty member is working on the project. In the case of this grant proposal, the proportion of the summer that Dr. Reyes is working is 0.5/2.5 months. The salary request for Dr. Reyes has been incremented by 3% per year, due to inflation. Other Personnel: Salaries and Wages For each year of the project period, we request salary funds for a part-time technician/computer programmer (to-be-hired) who will devote 6 calendar months of effort per year to the project (starting salary: $15,000/year). The technician/programmer will be responsible for writing various scripts (UNIX Cshell, Perl, etc.), creating web applications and web pages for the group, maintenance of our various databases, among others. In addition, we request funds for a full-time postdoctoral research associate (to-be-hired), with an annual starting salary of $40,000 per year. This postdoctoral research associate will devote 11.76 calendar months of effort to the project per year and will be responsible for getting novel research projects going, write pertinent reviews and papers for his/her own career development, help the undergraduate and graduate students in the group as needed, etc. The salaries for the part-time technician/computer programmer and for the postdoctoral research associate have been incremented by 3% per year, due to inflation. Furthermore, we request funds for 4 part-time bioinformatics master’s students to each work 12 hours per week, at $12/hour, for 20 weeks, per year ($11,520 per year). These bioinformatics students will have their own individual projects as theses, as assigned and outlined by the PI. Due to the students’ part-time status at RIT and RIT’s regulations, these students will be hired as temporary employees for this role. Different students will be paid for their work on the project each year. We request funds for an undergraduate student to work during each academic year, for 10 hours/week, for 30 weeks, at $11/hour ($3,300 per year). This student will carry out a project in the area of computational structural biology, the main area of the PI's research program. Also, we request funds for an undergraduate student to work during each summer, for 20 hours/week, for 10 weeks, at $11/hour ($2,200 per year). This student will also be working on a project in the area of computational structural biology.

i

Fringe Benefits We request fringe benefits for Dr. Reyes for each academic year, at 29.6% for Y1, 30.6% for Y2, and 31.6% for Y3 (reflecting an escalation of 100 basis points per year), and at 7.9% per year for each summer. In addition, we request fringe benefits for the part-time technician/computer programmer and for the postdoctoral research associate, at 29.6% for Y1, 30.6% for Y2, and 31.6% for Y3 (reflecting an escalation of 100 basis points per year). In addition, we request fringe benefits for the temporary employees at 7.9% per year. All fringe benefits are based upon RIT’s federally negotiated benefit rates for work on federally sponsored projects (DHHS rate agreement, effective 07/01/09-06/30/10 and provisional until new rates are negotiated). Capital Equipment We request funding for three terabytes of storage space, in order to store datasets, generated data results and some necessary software at the cost of $3,000, in Year 1. Travel We request $4,000 in funds for the PI to travel to Urbana-Champaign, IL to attend the Advanced Mathematics Workshop in Year 1. This will cover tuition, airfare, lodging, and meals. At this workshop, the PI will learn about some specialized abstract mathematical tools available in Mathematica (including NKS, etc.) but not by other means, which will be applied to the PI's current and future projects in the field of computational structural biology. A good example will be the application of non-Euclidean geometries to the representation and analysis of macromolecular structures. In addition, we request $2000 in funds for the PI to attend the Gordon conference in Year 3 of the project period (including registration, transportation, lodging, and meals), in order to present the research results from his group and get feedback from his peers, as well as share his knowledge and expertise in providing feedback on the results from other research groups. Also, we request $1,500 in domestic travel funds per year for the PI to travel to a professional meeting, such as the computational biology and bioinformatics conferences sponsored by the International Society for Computational Biology (ISCB). This will include registration, transportation, lodging, and meals. Furthermore, we request funds for the PI and for 4 master’s students to attend the th SUNY Albany 17 Conversation Conference during Year 1 and Year 3 of the project period ($1000/person/trip x 5 travelers per year = $5,000 per year for Year 1 and Year 3). This will include registration, transportation, lodging, and meals. We request $1500 per year in domestic travel funds for the postdoctoral research associate to attend one professional meeting per year (e.g., computational biology and bioinformatics conferences sponsored by the ISCB). Other Direct Costs We request $1,000 for software (such as the most current versions of Mathematica and MATLAB) in Year 1. Also, we request $3,000 in Year 1 for RAM+CPU upgrades for computers used for the project. In addition, we request two graphics monitors (one for the PI’s office, and one for use of his master’s students on the project) at $800 each (total: $1,600) in Year 1. The purpose of the graphics monitors is to allow us to use our visualization tools for macromolecular 3D structures efficiently and with reasonable speed.

ii

Furthermore, we request $2,000 in funds for page charges for journal articles per year in Years 2 and 3, in order for the PI to be able to publish and disseminate the project results. We request Information Technology and Service (ITS) charges at RIT’s rate on federally sponsored projects of $88.70/FTE/month. This covers services such as the maintenance of email accounts. ITS charges do not apply to faculty summer salary. Indirect (Facilities and Administrative) Costs RIT’s federally-negotiated indirect cost rate (U.S. Department of Health and Human Services, effective 07/01/09 and forward until a new rate is negotiated) is 44.5% of modified total direct costs (total direct costs - equipment over $1500 – Participant Support Costs – Subawards in Excess of $25,000). A copy of RIT’s indirect cost rate agreement is available upon request. Total Request to NSF: $496,892

iii

Current and Pending Support (See GPG Section II.C.2.h for guidance on information to include on this form.) The following information should be provided for each investigator and other senior personnel. Failure to provide this information may delay consideration of this proposal.

Other agencies (including NSF) to which this proposal has been/will be submitted.

Investigator: Vicente Reyes Support:

Current

Pending

Submission Planned in Near Future

*Transfer of Support

Project/Proposal Title: F.E.A.D. Summer Salary Award

College of Science, RIT Source of Support: Total Award Amount: $ 3,000 Total Award Period Covered: 07/01/10 - 08/31/10 Location of Project: RIT Person-Months Per Year Committed to the Project. Cal:0.00 Acad: 0.00 Sumr: 0.50 Support:

Current

Pending

Submission Planned in Near Future

*Transfer of Support

Project/Proposal Title: ABI Innovation: Use of a Reduced Protein Representation for

the Modeling and Screening of Ligand Binding Sites: A Structure-Based Protein Function Prediction Method NSF Source of Support: Total Award Amount: $ 496,893 Total Award Period Covered: 09/01/11 - 08/31/14 Location of Project: RIT Person-Months Per Year Committed to the Project. Cal:0.00 Acad: 0.50 Sumr: 0.50 Support:

Current

Pending

Submission Planned in Near Future

*Transfer of Support

Project/Proposal Title: ABI Innovation: Use of Spherical and Cylindrical Coordinate

Systems to Represent Protein 3D Structures: Applications to Epitope Mapping and Ligand Binding Site Prediction NSF Source of Support: Total Award Amount: $ 496,893 Total Award Period Covered: 09/01/11 - 08/31/14 Location of Project: RIT Person-Months Per Year Committed to the Project. Cal:0.00 Acad: 0.50 Sumr: 0.50 Support:

Current

Pending

Submission Planned in Near Future

*Transfer of Support

Project/Proposal Title:

Source of Support: Total Award Amount: $ Total Award Period Covered: Location of Project: Person-Months Per Year Committed to the Project. Cal: Acad: Support:

Current

Pending

Submission Planned in Near Future

Sumr: *Transfer of Support

Project/Proposal Title:

Source of Support: Total Award Amount: $ Total Award Period Covered: Location of Project: Person-Months Per Year Committed to the Project. Cal: Acad:

Summ:

*If this project has previously been funded by another agency, please list and furnish information for immediately preceding funding period.

Page G-1

USE ADDITIONAL SHEETS AS NECESSARY

1

Facilities, Equipment and Other Resources. Faculty of the Bioinformatics Program at RIT have exclusive access to a variety of computational resources. These include a number of dedicated computing systems such as several Sun Enterprise 450 servers: one has 4GB of RAM memory and 432 GB of hard drive storage space, two others have 4GB of RAM and approximately 1 TB of internal storage. Additionally we have two Dell PowerEdge 1850 servers with dual dual core Xeon processors, 4GB of RAM and 600GB of internal RAID 0 storage and, one Dell PowerEdge 1950 with dual dual core Xeon processors, 4GB of RAM and 300GB of storage, and one SunFire 2100 dual core Opteron machine with 2GB of RAM and 160GB of internal storage which is a dediated BLAST and e-mail server. That network is attached to an additional 9.1TB of storage (NAS) for database housing and maintenance with an additional 3.6TB of storage located in a separate facility. We also have one machine dedicated to monitoring network and server health and one serving as a bridge firewall. RIT has multiple means of external network connectivity including 400Mb/s and 200Mb/s Internet2 connections routed via Gigabit Ethernet and a 45Mb/s T3 connection for backup. RIT owns and operates its own Dense Wave Division Multiplexing (DWDM) network for connecting the Gigabit Ethernet connections to our ISPs. The DWDM network (as configured) has the carrying capability of 32 10Gb/s channels (lambdas) to points of interest from the RIT community. Wireless connectivity is ubiquitous throughout campus. RIT also has access to the program SAS through a site license. The author has access as well to various software and approx. 10 Tb of data storage at the department level (http:// www. adobo. bioinformatics. rit. edu), and to even more computing resources and memory space at RIT's I.T. Collaboratory (http:// www. rit. edu/ research/ itc/) and Research Computing (http:// rc. rit. edu/). RIT’s office of Sponsored Programs Accounting will handle project accounting and budgetary procedures. The budget will be monitored via RIT’s Oracle System, which tracks all expenses. RIT’s office of Sponsored Research Services will manage grant reporting.

1

Postdoctoral Mentoring Plan: In order to ensure that the "postdoctoral product" of the PI is of high quality and fully capable of becoming an integral part of the next generation of independent and multidisciplinary scientific researchers -- be it in the academe, government or industry -the following steps will be taken:

• he/she will be encouraged to sit in or audit courses at RIT related to computational

biology or bioinformatics in which he/she has little or no background • he/she will be required to disseminate his/her scientific results in peer-reviewed scientific journals and conferences; • he/she will be asked to actively participate in the writing/preparation of grant proposals by the PI; • he/she will be encouraged to take advantage of career counseling services offered by RIT; • he/she will be asked occasionally to advise/guide research projects of graduate and undergraduate students in the group; • he/she will be strongly encouraged to attend courses and/or workshops in scientific and professional ethics; • he/she will be trained in the scientific method and way of thinking by providing him/her with a lot of experience in posing the right questions when solving an open, broad scientific problem. The office of Sponsored Research Services at RIT conducts extensive organized training for principal investigators and other personnel involved in externally sponsored projects. In the 2009-2010 academic year, sponsored research staff organized over fifty hours of training addressing topics including peer review, funding agency overviews, compliance, intellectual property, budgeting and others. These sessions were attended by 250 individuals. Post-docs are strongly encouraged to attend these training events. Additionally, Sponsored Research Services and the office of Teaching and Learning Services at RIT sponsor and run an annual Grant Writers’ Boot Camp, an intensive twoday session in the fundamentals of grant writing and peer review. Participants, including post-docs, come prepared with proposals for internal seed funding awards that are reviewed and revised over the course of the program. The Principal Investigator will work with the post-doc on an individualized professional development plan to take advantage of these and other services at RIT.

02 INFORMATION ABOUT PRINCIPAL INVESTIGATORS/PROJECT DIRECTORS(PI/PD) and co-PRINCIPAL INVESTIGATORS/co-PROJECT DIRECTORS Submit only ONE copy of this form for each PI/PD and co-PI/PD identified on the proposal. The form(s) should be attached to the original proposal as specified in GPG Section II.C.a. Submission of this information is voluntary and is not a precondition of award. This information will not be disclosed to external peer reviewers. DO NOT INCLUDE THIS FORM WITH ANY OF THE OTHER COPIES OF YOUR PROPOSAL AS THIS MAY COMPROMISE THE CONFIDENTIALITY OF THE INFORMATION. PI/PD Name:

Vicente M Reyes

Gender:

Male

Female

Ethnicity: (Choose one response)

Hispanic or Latino

Race: (Select one or more)

American Indian or Alaska Native

Not Hispanic or Latino

Asian Black or African American Native Hawaiian or Other Pacific Islander White

Disability Status: (Select one or more)

Hearing Impairment Visual Impairment Mobility/Orthopedic Impairment Other None

Citizenship:

(Choose one)

U.S. Citizen

Permanent Resident

Other non-U.S. Citizen

Check here if you do not wish to provide any or all of the above information (excluding PI/PD name): REQUIRED: Check here if you are currently serving (or have previously served) as a PI, co-PI or PD on any federally funded project Ethnicity Definition: Hispanic or Latino. A person of Mexican, Puerto Rican, Cuban, South or Central American, or other Spanish culture or origin, regardless of race. Race Definitions: American Indian or Alaska Native. A person having origins in any of the original peoples of North and South America (including Central America), and who maintains tribal affiliation or community attachment. Asian. A person having origins in any of the original peoples of the Far East, Southeast Asia, or the Indian subcontinent including, for example, Cambodia, China, India, Japan, Korea, Malaysia, Pakistan, the Philippine Islands, Thailand, and Vietnam. Black or African American. A person having origins in any of the black racial groups of Africa. Native Hawaiian or Other Pacific Islander. A person having origins in any of the original peoples of Hawaii, Guam, Samoa, or other Pacific Islands. White. A person having origins in any of the original peoples of Europe, the Middle East, or North Africa. WHY THIS INFORMATION IS BEING REQUESTED: The Federal Government has a continuing commitment to monitor the operation of its review and award processes to identify and address any inequities based on gender, race, ethnicity, or disability of its proposed PIs/PDs. To gather information needed for this important task, the proposer should submit a single copy of this form for each identified PI/PD with each proposal. Submission of the requested information is voluntary and will not affect the organization’s eligibility for an award. However, information not submitted will seriously undermine the statistical validity, and therefore the usefulness, of information recieved from others. Any individual not wishing to submit some or all the information should check the box provided for this purpose. (The exceptions are the PI/PD name and the information about prior Federal support, the last question above.) Collection of this information is authorized by the NSF Act of 1950, as amended, 42 U.S.C. 1861, et seq. Demographic data allows NSF to gauge whether our programs and other opportunities in science and technology are fairly reaching and benefiting everyone regardless of demographic category; to ensure that those in under-represented groups have the same knowledge of and access to programs and other research and educational oppurtunities; and to assess involvement of international investigators in work supported by NSF. The information may be disclosed to government contractors, experts, volunteers and researchers to complete assigned work; and to other government agencies in order to coordinate and assess programs. The information may be added to the Reviewer file and used to select potential candidates to serve as peer reviewers or advisory committee members. See Systems of Records, NSF-50, "Principal Investigator/Proposal File and Associated Records", 63 Federal Register 267 (January 5, 1998), and NSF-51, "Reviewer/Proposal File and Associated Records", 63 Federal Register 268 (January 5, 1998).

COVER SHEET FOR PROPOSAL TO THE NATIONAL SCIENCE FOUNDATION PROGRAM ANNOUNCEMENT/SOLICITATION NO./CLOSING DATE/if not in response to a program announcement/solicitation enter NSF 10-1

NSF 10-567

FOR NSF USE ONLY

NSF PROPOSAL NUMBER

08/23/10

FOR CONSIDERATION BY NSF ORGANIZATION UNIT(S)

(Indicate the most specific unit known, i.e. program, division, etc.)

DBI - ADVANCES IN BIO INFORMATICS DATE RECEIVED NUMBER OF COPIES DIVISION ASSIGNED FUND CODE DUNS# (Data Universal Numbering System)

FILE LOCATION

002223642 EMPLOYER IDENTIFICATION NUMBER (EIN) OR TAXPAYER IDENTIFICATION NUMBER (TIN)

IS THIS PROPOSAL BEING SUBMITTED TO ANOTHER FEDERAL AGENCY? YES NO IF YES, LIST ACRONYM(S)

SHOW PREVIOUS AWARD NO. IF THIS IS A RENEWAL AN ACCOMPLISHMENT-BASED RENEWAL

160743140 NAME OF ORGANIZATION TO WHICH AWARD SHOULD BE MADE

ADDRESS OF AWARDEE ORGANIZATION, INCLUDING 9 DIGIT ZIP CODE

Rochester Institute of Tech 1 Lomb Memoria Drive Rochester, NY. 146235603

Rochester Institute of Tech AWARDEE ORGANIZATION CODE (IF KNOWN)

0028068000 NAME OF PERFORMING ORGANIZATION, IF DIFFERENT FROM ABOVE

ADDRESS OF PERFORMING ORGANIZATION, IF DIFFERENT, INCLUDING 9 DIGIT ZIP CODE

PERFORMING ORGANIZATION CODE (IF KNOWN)

IS AWARDEE ORGANIZATION (Check All That Apply) (See GPG II.C For Definitions) TITLE OF PROPOSED PROJECT

MINORITY BUSINESS IF THIS IS A PRELIMINARY PROPOSAL WOMAN-OWNED BUSINESS THEN CHECK HERE

ABI Innovation: Use of Spherical and Cylindrical Coordinate Systems to Represent Protein 3D Structures: Applications to Epitope Mapping and Ligand Binding Site Prediction

REQUESTED AMOUNT

PROPOSED DURATION (1-60 MONTHS)

496,892

$

SMALL BUSINESS FOR-PROFIT ORGANIZATION

36

REQUESTED STARTING DATE

09/01/11

months

SHOW RELATED PRELIMINARY PROPOSAL NO. IF APPLICABLE

CHECK APPROPRIATE BOX(ES) IF THIS PROPOSAL INCLUDES ANY OF THE ITEMS LISTED BELOW BEGINNING INVESTIGATOR (GPG I.G.2) HUMAN SUBJECTS (GPG II.D.7) Human Subjects Assurance Number DISCLOSURE OF LOBBYING ACTIVITIES (GPG II.C.1.e)

Exemption Subsection

PROPRIETARY & PRIVILEGED INFORMATION (GPG I.D, II.C.1.d)

INTERNATIONAL COOPERATIVE ACTIVITIES: COUNTRY/COUNTRIES INVOLVED

HISTORIC PLACES (GPG II.C.2.j)

(GPG II.C.2.j)

EAGER* (GPG II.D.2)

RAPID** (GPG II.D.1)

VERTEBRATE ANIMALS (GPG II.D.6) IACUC App. Date

HIGH RESOLUTION GRAPHICS/OTHER GRAPHICS WHERE EXACT COLOR REPRESENTATION IS REQUIRED FOR PROPER INTERPRETATION (GPG I.G.1)

PHS Animal Welfare Assurance Number PI/PD DEPARTMENT

PI/PD POSTAL ADDRESS

1 LOMB MEMORIAL DR

Biological Sciences PI/PD FAX NUMBER NAMES (TYPED)

or IRB App. Date

ROCHESTER, NY 146235603 United States High Degree

Yr of Degree

Telephone Number

PhD

1988

585-475-4115

Electronic Mail Address

PI/PD NAME

Vicente M Reyes CO-PI/PD

CO-PI/PD

CO-PI/PD

CO-PI/PD

Page 1 of 2

[email protected]

CERTIFICATION PAGE Certification for Authorized Organizational Representative or Individual Applicant: By signing and submitting this proposal, the Authorized Organizational Representative or Individual Applicant is: (1) certifying that statements made herein are true and complete to the best of his/her knowledge; and (2) agreeing to accept the obligation to comply with NSF award terms and conditions if an award is made as a result of this application. Further, the applicant is hereby providing certifications regarding debarment and suspension, drug-free workplace, lobbying activities (see below), responsible conduct of research, nondiscrimination, and flood hazard insurance (when applicable) as set forth in the NSF Proposal & Award Policies & Procedures Guide, Part I: the Grant Proposal Guide (GPG) (NSF 10-1). Willful provision of false information in this application and its supporting documents or in reports required under an ensuing award is a criminal offense (U. S. Code, Title 18, Section 1001).

Conflict of Interest Certification In addition, if the applicant institution employs more than fifty persons, by electronically signing the NSF Proposal Cover Sheet, the Authorized Organizational Representative of the applicant institution is certifying that the institution has implemented a written and enforced conflict of interest policy that is consistent with the provisions of the NSF Proposal & Award Policies & Procedures Guide, Part II, Award & Administration Guide (AAG) Chapter IV.A; that to the best of his/her knowledge, all financial disclosures required by that conflict of interest policy have been made; and that all identified conflicts of interest will have been satisfactorily managed, reduced or eliminated prior to the institution’s expenditure of any funds under the award, in accordance with the institution’s conflict of interest policy. Conflicts which cannot be satisfactorily managed, reduced or eliminated must be disclosed to NSF.

Drug Free Work Place Certification By electronically signing the NSF Proposal Cover Sheet, the Authorized Organizational Representative or Individual Applicant is providing the Drug Free Work Place Certification contained in Exhibit II-3 of the Grant Proposal Guide.

Debarment and Suspension Certification

(If answer "yes", please provide explanation.)

Is the organization or its principals presently debarred, suspended, proposed for debarment, declared ineligible, or voluntarily excluded from covered transactions by any Federal department or agency?

Yes

No

By electronically signing the NSF Proposal Cover Sheet, the Authorized Organizational Representative or Individual Applicant is providing the Debarment and Suspension Certification contained in Exhibit II-4 of the Grant Proposal Guide.

Certification Regarding Lobbying The following certification is required for an award of a Federal contract, grant, or cooperative agreement exceeding $100,000 and for an award of a Federal loan or a commitment providing for the United States to insure or guarantee a loan exceeding $150,000.

Certification for Contracts, Grants, Loans and Cooperative Agreements The undersigned certifies, to the best of his or her knowledge and belief, that: (1) No federal appropriated funds have been paid or will be paid, by or on behalf of the undersigned, to any person for influencing or attempting to influence an officer or employee of any agency, a Member of Congress, an officer or employee of Congress, or an employee of a Member of Congress in connection with the awarding of any federal contract, the making of any Federal grant, the making of any Federal loan, the entering into of any cooperative agreement, and the extension, continuation, renewal, amendment, or modification of any Federal contract, grant, loan, or cooperative agreement. (2) If any funds other than Federal appropriated funds have been paid or will be paid to any person for influencing or attempting to influence an officer or employee of any agency, a Member of Congress, an officer or employee of Congress, or an employee of a Member of Congress in connection with this Federal contract, grant, loan, or cooperative agreement, the undersigned shall complete and submit Standard Form-LLL, ‘‘Disclosure of Lobbying Activities,’’ in accordance with its instructions. (3) The undersigned shall require that the language of this certification be included in the award documents for all subawards at all tiers including subcontracts, subgrants, and contracts under grants, loans, and cooperative agreements and that all subrecipients shall certify and disclose accordingly. This certification is a material representation of fact upon which reliance was placed when this transaction was made or entered into. Submission of this certification is a prerequisite for making or entering into this transaction imposed by section 1352, Title 31, U.S. Code. Any person who fails to file the required certification shall be subject to a civil penalty of not less than $10,000 and not more than $100,000 for each such failure.

Certification Regarding Nondiscrimination By electronically signing the NSF Proposal Cover Sheet, the Authorized Organizational Representative is providing the Certification Regarding Nondiscrimination contained in Exhibit II-6 of the Grant Proposal Guide.

Certification Regarding Flood Hazard Insurance Two sections of the National Flood Insurance Act of 1968 (42 USC §4012a and §4106) bar Federal agencies from giving financial assistance for acquisition or construction purposes in any area identified by the Federal Emergency Management Agency (FEMA) as having special flood hazards unless the: (1) community in which that area is located participates in the national flood insurance program; and (2) building (and any related equipment) is covered by adequate flood insurance. By electronically signing the NSF Proposal Cover Sheet, the Authorized Organizational Representative or Individual Applicant located in FEMA-designated special flood hazard areas is certifying that adequate flood insurance has been or will be obtained in the following situations: (1) for NSF grants for the construction of a building or facility, regardless of the dollar amount of the grant; and (2) for other NSF Grants when more than $25,000 has been budgeted in the proposal for repair, alteration or improvement (construction) of a building or facility.

Certification Regarding Responsible Conduct of Research (RCR) (This certification is not applicable to proposals for conferences, symposia, and workshops.) By electronically signing the NSF Proposal Cover Sheet, the Authorized Organizational Representative of the applicant institution is certifying that, in accordance with the NSF Proposal & Award Policies & Procedures Guide, Part II, Award & Administration Guide (AAG) Chapter IV.B., the institution has a plan in place to provide appropriate training and oversight in the responsible and ethical conduct of research to undergraduates, graduate students and postdoctoral researchers who will be supported by NSF to conduct research. The undersigned shall require that the language of this certification be included in any award documents for all subawards at all tiers. AUTHORIZED ORGANIZATIONAL REPRESENTATIVE

SIGNATURE

DATE

NAME

TELEPHONE NUMBER

ELECTRONIC MAIL ADDRESS

FAX NUMBER

fm1207rrs-07

* EAGER - EArly-concept Grants for Exploratory Research ** RAPID - Grants for Rapid Response Research Page 2 of 2

Directorate for Biological Sciences Division of Biological Infrastructure Advances in Bio Informatics Proposal Classification Form PI: Reyes, Vicente CATEGORY I: INVESTIGATOR STATUS (Select ONE) Beginning Investigator - No previous Federal support as PI or Co-PI, excluding fellowships, dissertations, planning grants, etc. Prior Federal support only Current Federal support only Current & prior Federal support

CATEGORY II: FIELDS OF SCIENCE OTHER THAN BIOLOGY INVOLVED IN THIS RESEARCH (Select 1 to 3) Astronomy Chemistry Computer Science Earth Science

Engineering Mathematics Physics

Psychology Social Sciences None of the Above

CATEGORY III: SUBSTANTIVE AREA (Select 1 to 4) BIOMATERIALS BIOTECHNOLOGY Animal Biotechnology Plant Biotechnology Environmental Biotechnology Marine Biotechnology Metabolic Engineering CHROMOSOME STUDIES COMMUNITY ECOLOGY COMPUTATIONAL BIOLOGY CONSERVATION & RESTORATION BIOLOGY CORAL REEFS CURATION DATABASES ECOSYSTEMS LEVEL GENOMICS (Genome sequence, organization, function)

Viral Microbial Fungal Plant Animal

INFORMATICS MARINE MAMMALS Molecular Evolution Methodology/Theory Gene/Genome Mapping Natural Products NANOSCIENCE PHOTOSYNTHESIS PLANT BIOLOGY Arabidopsis-Related Plant Research POPULATION DYNAMICS & LIFE HISTORY

POPULATION GENETICS & BREEDING SYSTEMS REPRODUCTIVE ANIMAL BIOLOGY Plant Pathology Coevolution Biological Control STATISTICS & MODELING Methods/ Instrumentation/ Software Modeling (general) Modeling of Biological or Molecular Systems Computational Modeling

Statistics (general) STRUCTURAL BIOLOGY SYSTEMATICS Phenetics/Cladistics/ Numerical Taxonomy NONE OF THE ABOVE

CATEGORY IV: INFRASTRUCTURE (Select 1 to 3) COLLECTIONS/STOCK CULTURES Collection Enhancement Collection Refurbishment Living Organism Stock Cultures Natural History Collections DATABASES Database Initiation

Database Enhancement Database Maintenance & Curation Database Methods FACILITIES Controlled Environment Facilities Field Stations Field Facility Structure

Field Facility Equipment

LTER Site GENOME SEQUENCING Other Plant Genome Sequencing

INDUSTRY PARTICIPATION INSTRUMENTATION Instrument Development

Page 1

Instrument Acquisition Computational Hardware Development/Acquisition TOOLS DEVELOPMENT Analytical Algorithm Development Other Software Development Informatics Tool Development

Technique Development TRACKING SYSTEMS Geographic Information Systems Remote Sensing TRAINING

Multi-, Cross-, Interdisciplinary Training Undergraduate Training Predoctoral Training Postdoctoral Training NONE OF THE ABOVE

CATEGORY V: HABITAT (No selection required) CATEGORY VI: GEOGRAPHIC AREA OF THE RESEARCH (No selection required) CATEGORY VII: CLASSIFICATION OF ORGANISMS (Select 1 to 4) VIRUSES

PLANTS

Bacterial

N0N-VASCULAR PLANTS

Plant

VASCULAR PLANTS

Animal

PROKARYOTES

GYMNOSPERMS ANGIOSPERMS

Archaebacteria

Monocots

Cyanobacteria

Dicots

Eubacteria

PROTISTA (PROTOZOA) FUNGI LICHENS SLIME MOLDS ALGAE

ANIMALS INVERTEBRATES ARTHROPODA Hexapoda (Insecta) (Insects) VERTEBRATES FISHES

Chondrichthyes (Cartilaginous Fishes) (Sharks, Rays, Ratfish) Osteichthyes (Bony Fishes) AMPHIBIA REPTILIA AVES (Birds) MAMMALIA Primates Humans Rodentia Marine Mammals (Seals, Walrus, Whales, Otters, Dolphins, Porpoises)

TRANSGENIC ORGANISMS NO ORGANISMS

CATEGORY VIII: MODEL ORGANISM (Select ONE) NO MODEL ORGANISM MODEL ORGANISM (Choose from the list or input up to 9 characters)

FUNGAL PLANT Mouse-Ear Cress (Arabidopsis thaliana)

Fruitfly (Drosophila melanogaster)

[Enter your own model organism - up to 9 characters]

Escherichia coli

Page 2

1 Abstract. Three-dimensional objects can be represented using Cartesian, spherical or cylindrical coordinate systems, among others. Currently all protein 3D structures in the PDB are in Cartesian coordinates. Can a transformation of coordinates allow some applications that may otherwise be too cumbersome or impractical in Cartesian coordinates? We wrote a Fortran program to transform protein 3D structure files in Cartesian coordinates (x,y,z) to spherical coordinates (ρ,φ,θ), with the centroid of the protein molecule as origin. Here we propose and present preliminary results regarding two applications of this coordinate transformation, namely, (1.) separation of the protein outer layer (OL) from the inner core (IC); and (2.) identifying protrusions and invaginations on the protein surface. In the first application, φ and θ were partitioned into suitable intervals and the point with maximum ρ in each such 'φ-θ bin' determined. A suitable cutoff value for ρ is determined, and for each φ-θ bin, all points with ρ values less than the cutoff are considered part of the IC, and those with ρ values equal to or greater than the cutoff are considered part of the OL. We show that this separation procedure is successful as it gives rise to an OL that is significantly more enriched in hydrophilic amino acid residues and an IC that is significantly more enriched in hydrophobic amino acid residues as expected, with the exception of membrane proteins. We propose that a particular application of this method is epitope mapping, when clusters of points in the OL are detected by plotting φ-θ bin point densities against φ and θ. In the second application, the point with maximum ρ in each φ-θ bin are sequestered and their frequency distribution constructed (i.e., maximum ρ's sorted from lowest to highest, collected into 1.50 Å intervals, and the frequency in each interval plotted). We show in such plots that invaginations on the protein surface give rise to subpeaks on the lagging side of the main peak, while protrusions give rise to similar subpeaks, but on the leading side of the main peak. We propose that a particular application of this method is the prediction of ligand binding sites on the protein surface. The dataset of Laskowski et al. (1996) was used to demonstrate both applications. We are now working to extend our method to the use of cylindrical coordinates (r,θ,z) to represent rod-shaped proteins and viruses. We have written the requisite Fortran codes that will do so given the Cartesian coordinates of the two extreme points ("tips") of the cylindrical structure. Finally, we propose to create databases and web servers that will display, store and transform inputted protein PDB structures to the two aforementioned coordinate systems. Intellectual Merit. This work offers a lot of opportunity in advancing knowledge across different domains since it combines mathematics, computing, molecular biology and proteomics. The PI is multidisciplinary researcher with formal training in mathematics, chemistry, biochemistry, molecular biology, statistics data mining, bioinformatics and computational biology. We propose to represent protein structures in spherical and cylindrical coordinates as appropriate (i.e., for globular and rod-shaped macromolecules, respectively), making this work, to the best of our knowledge, the first time protein structure will be represented in a coordinate system other than Cartesian. The seminal ideas underlying this project were conceived in the early 2007 while the PI was a post-doc at UCSD; they were carefully developed and refined throughout the next couple of years until he joined the RIT faculty in late 2008, and up to the present. The PI has access to various software and approx. 10 Tb of data storage at the department level, and to much more resources and disc space at RIT's I.T. Collaboratory and Research Computing Department. Broader Impact. The PI teaches courses in proteomics and bioinformatics and has initiated a researched-based teaching method in order to advance discovery while promoting teaching, training and learning. In this method, the students are given a "freestyle" laboratory exercise involving an open problem in the field. He will present the successful results of this innovative teaching technique in a video clip at ICERI 2010 (International Conference on Education, Research and Innovation). Our institute is the home for the National Technical Institute for the Deaf (NTID), and as such we have the unique opportunity to increase the participation of this underrepresented group into our research programs. Our institute also serves many talented minority and foreign students. Here we propose to create databases for protein structures expressed in spherical coordinates (globular proteins) or cylindrical coordinate (rodshaped proteins and viruses), thereby enhancing infrastructure for research and education. The PI along with a post-doc, a technician and graduate students plan to actively participate in national as well as international meetings in order to disseminate their work. These activities will be the prelude to the publication of their findings in primary, peer-reviewed scientific journals. One application of this work which we demonstrate here - is computational epitope mapping, which can have significant impact on vaccine design, a societal benefit. A post-doc, a technician and undergraduate and graduate students will all be trained in the scientific method and way of thinking, so they can be independent scientific researchers in the future.

TABLE OF CONTENTS For font size and page formatting specifications, see GPG section II.B.2.

Total No. of Pages

Page No.* (Optional)*

Cover Sheet for Proposal to the National Science Foundation Project Summary

(not to exceed 1 page)

1

Table of Contents

1

Project Description (Including Results from Prior NSF Support) (not to exceed 15 pages) (Exceed only if allowed by a specific program announcement/solicitation or if approved in advance by the appropriate NSF Assistant Director or designee)

15

References Cited

2

Biographical Sketches

(Not to exceed 2 pages each)

Budget

2 7

(Plus up to 3 pages of budget justification)

Current and Pending Support

1

Facilities, Equipment and Other Resources

1

Special Information/Other Supplementary Docs/Mentoring Plan

1

Appendix (List below. ) (Include only if allowed by a specific program announcement/ solicitation or if approved in advance by the appropriate NSF Assistant Director or designee) Appendix Items:

*Proposers may select any numbering mechanism for the proposal. The entire proposal however, must be paginated. Complete both columns only if the proposal is numbered consecutively.

1 Use of Spherical and Cylindrical Coordinate Systems to Represent Protein 3D Structures: Applications to Epitope Mapping and Ligand Binding Site Prediction Submitted by Vicente M. Reyes, Ph.D. [e-mail: [email protected]] ____________________________________________________________________________________ TABLE OF CONTENTS: I. A. B. C. D. E.

page Introduction & Background............................................................................................................ 2 Specific Objectives and Statement of Work ............................................................................... 2 General Plan and Design of Work. ............................................................................................... 3 Relationship of Work to Long-term Goals of PI. ........................................................................... 3 Relationship of Work to State of Knowledge in the Field................................................................ 3

II. Experimental Methods and Procedures. ............................................................................................. 3 A. Application #1: OL -- IC Separation. ............................................................................................ 3 1. Conversion of AAR to DCRR. ..................................................................................................... 3 2. Conversion of Cartesian to Spherical Coordinates. ..................................................................... 4 3. Binning the spherical representation of the protein. ..................................................................... 4 4. Separation of the inner core (IC) and the outer layer (OL). ........................................................... 4 5. Using the OL to predict candidate epitopes. ................................................................................ 4 6. Using the IC to find possible protein functional sites. ................................................................... 6 B. Application #2: Surface Topography. ............................................................................................ 6 1. An Artificial Protein. ................................................................................................................... 6 2. Frequency Distribution of Maximum Rhos. ................................................................................ 7 III. Preliminary Results. ....................................................................................................................... 7 A. Application #1: OL -- IC Separation. ..............................................................................................7 1. Amino Acid Compositions of OL and IC. ...................................................................................... 7 2. Finding Candidate Epitopes. ....................................................................................................... 8 3. Results for prediction of buried active sites. ................................................................................ 9 4. Refinement Tests for Epitope Prediction: Fine vs. Coarse Binning and AAR vs. DCRR.. ....... 9 B. Application #2: Surface Topography. ........................................................................................... 9 1. Use of Artificial Protein to Identify Invaginations & Protrusions on Protein Surface. .................... 9 2. Application to Real Proteins in the Laskowski Data Set. .............................................................. 10 IV. Concluding Remarks. ................................................................................................................... 11 A. Conclusions and Future Directions. ............................................................................................ 11 B. Preservation/Documentation/Sharing of Data & Related Research/Education Products. ......... 12 V. Broader Impact. .............................................................................................................................. 12 VI. Management Plan. .......................................................................................................................... 13 VII. Miscellaneous Items. ...................................................................................................................... 13 A. Keywords. ..................................................................................................................................... 14 B. Abbreviations. ............................................................................................................................... 14 C. Definitions. .................................................................................................................................... 14 VIII. Appendix. ....................................................................................................................................... 15 ___________________________________________________________________________________

2 I. Introduction & Background. We want to explore the possibility that representation of protein 3D structures in coordinate systems other than Cartesian might find useful novel applications. We investigate here the use of spherical coordinates for globular proteins and of cylindrical coordinates for rod-shaped proteins (and viruses). 6SKHULFDOFRRUGLQDWHUHSUHVHQWDWLRQLQYROYHVWKHWKUHHFRRUGLQDWHVȡijDQGșLQ ZKLFKZKHQDQDORJL]HGZLWKHDUWKPHDVXUHPHQWVijDQGșZRXOGFRUUHVSRQGWRODWLWXGHVDQGORQJLWXGHV LQ DQJXODU XQLWV  ZKLOH ȡ ZRXOG FRUUHVSRQG WR HOHYDWLRQ although not with respect to sea level, but instead to the center of the earth (the "origin" of the system; see, for example, http: // www. math. PRQWDQDHGXIUDQNZFFSPXOWLZRUOGPXOWLSOH,93VSKHULFDOOHDUQKWP >@ijJRHVIURPÛWRÛZKLOH ș goes from 0ÛWRÛȡRQWKHRWKHUKDQGLVQRQQHJDWLYH We shall discuss the two applications of the spherical coordinate representation of proteins separately in each section that follows. The first application, that of separating the protein outer layer (OL) from the inner core (IC) will be called "OL-IC Separation", while the second application, that of the identification of protrusions and invaginations on the protein surface, will be termed "Surface Topography." Application #1: OL -- IC Separation: Surface Properties vs. Buried Properties of Proteins. It is widely established that proteins fold in such a way that hydrophilic residues are exposed on the surface while hydrophobic ones are buried in the interior, although with some exceptions, such as integral membrane proteins, etc. Thus in general we expect the protein surface to have different properties form the protein interior. Surface features of proteins include shallow ligand binding sites and active sites (although some are deeply buried), protein-protein interaction sites, post-translational modifications sites, and epitopes, to name a few. Thus studies of the OL separate from the IC, and vice versa, might find use in drug design, protein inhibition, computational epitope mapping and vaccine design [2], [3], [4]. As a proof of concept, we shall describe here a method for prediction of candidate epitopes using the OL-IC separation method. It involves clustering of points/atomic coordinates in the OL, then screening for those which contain points/atomic coordinates above a certain density as potential epitopes. We shall also briefly address the hypothesis that, since they are exceptions to the rule, any hydrophobic residues found within the OL might have biological roles. Buried features of proteins, on the other hand, include deep ligand binding sites and active sites (although some are shallow or close to the surface), prosthetic group binding sites, deep metal ion binding sites, and other features largely hitherto unidentified that contribute to the overall stability and integrity of the folded structure. The protein IC is largely hydrophobic as the aqueous environment of the cell prefers to interact with the hydrophilic residues which orient themselves on the surface to attain the most energetically stable overall configuration of the protein molecule. Our hypothesis regarding the IC is that occurrence of hydrophilic residues there must have some functional significance to overcome the energy constraint. We thus screened the IC of our test proteins [5] for hydrophilic residues and demonstrate potential applicability of this method for predicting buried functional sites in proteins. Application #2: Protein Surface Topography. The second application involves characterization of the exterior topography of proteins by the detection of invaginations and protrusions on the surface. This application is different from the first in that it does not involve separation of an outer layer from an inner core, but merely an investigation of the "surface positions" in a protein molecule: those positions which are farthest away from the protein centroid. If the protein structure is in spherical coordinate representation with its centroid as origin - as they are in our algorithm - these are the points with PD[LPXPȡLQHDFKij-șELQ7KHVHSRVLWLRQVZLWKȡPD[LPDPD\WKRXJKWRIDVREMHFWVRQWKHVXUIDFHRI the earth, if the earth was the protein. In this application, we analyze the frequency distribution (FD) of VXFKȡPD[LPD and, as we demonstrate in the next section, features emerge in the FD plot that indicate the presence of protrusions and invaginations on the protein surface. This application may be of practical importance because protrusions and invaginations on the protein surface commonly have biological significance. For example, clefts may represent ligand binding sites, and protrusions may represent loops or small lobes that open and close onto a binding pocket. B. Specific Objectives and Statement of Work. The specific objectives of this proposal is the continuance and further development of the work we have started on representing protein 3D structures in spherical and cylindrical coordinates. They are: 1. To create a web server to transform all globular proteins in the PDB to spherical coordinates and create a web site and database for the transformed structures.

3 2. To build a database and web server that can separate the outer later (OL) from the inner core (IC) an input protein 3D structure PDB file. 3. To continue to develop and refine our computational epitope mapping and protein surface topographical analysis procedures based on spherical coordinate representation. 4. To create a web server to transform all rod-shaped proteins and viruses in PDB Cartesian coordinates to cylindrical coordinates and create a web site and database for the transformed structures. 5. To find useful novel applications of the cylindrical coordinate transformed rod-shaped protein and virus structures. C. General Plan and Design of Work. The work on all five specific objectives will be done in parallel since they have very minimal or no dependencies. The great majority of calculation programs will be implemented in Fortran 77 or 90, embedded in UNIX C-shell or Perl scripts. Thus far, the author has written all of the Fortran programs and this scenario will likely to continue until a proficient Fortran programmer joins the group. Flow diagrams that incorporate several Fortran programs that perform unit steps in an entire algorithmic procedure will be used extensively, as has been done in the past. To test the procedure, the scripts are run on first small datasets (10-12 elements), the results of which are then manually confirmed individually. If this step is successful, medium-sized datasets (50-60 elements) are used, and about a quarter of the results randomly selected and checked manually. If this step proves a success, the full possible dataset (usually a subset of or the entire PDB dataset) is used. While these steps are proceeding, the programs and scripts are constantly tested, refined and optimized. D. Relationship of Work to Long-term goals of PI. This work fits in excellently within the overall research program of the author, which is in the area of computational structural biology, and which revolves around novel methods of 3D representation and 3D structural analysis of proteins and other biomolecular structures (DNA, RNA, carbohydrates and lipids). These research area is also an area of strength of the author since he has had several years of training in x-ray crystallography besides wetbench molecular biology and biochemistry, as well as mathematics and computational science. E. Relationship of Work to State of Knowledge in the Field. There is very sparse or almost nonexistent literature on the subject of representing proteins in coordinate systems other than Cartesian, so a detailed literature review on the topic is not currently possible. In the broader subject of novel protein 3D structure representations, the works by Barlow & Richards (1995) and Li et al. (2003), but neither deals with a change of coordinate systems, especially for the purpose of finding novel ways to analyze protein structures not possible with Cartesian systems for features of biological or chemical interest. II. Experimental Methods and Procedures. %RWK DSSOLFDWLRQVUHTXLUHWKDW ZHSDUWLWLRQRUELQERWKij DQG ș LQ RUGHU WR FUHDWH ij-ș ELQV  7KLV ELQQLQJ SURFHVV FDQ EH GRQH FRDUVHO\ RU ILQHO\ ZH GHILQH "coarVH ELQQLQJ DV SDUWLWLRQLQJ ERWK ij DQG  ș LQWR Û LQWHUYDOV  ZKLOH ZH GHILQH ILQH ELQQLQJ DV SDUWLWLRQLQJijLQWRÛLQWHUYDOVDQGșLQWRÛLQWHUYDOV&RDUVHELQQLQJUHVXOWVLQ[ ij -șELQV ZKLOHILQHELQQLQJUHVXOWVLQ[ ij-șEins. We typically use fine binning for proteins in regular all-atom representation (AAR), while we typically used coarse binning for proteins in reduced representation called 'double-centroid reduced representation' (DCRR; see references and below). In AAR, each atom of the protein has a 3D coordinate; in the DCRR, there are only two coordinates per amino acid residue: that of the centroid of the backbone atoms, and that of the centroid of the side chain atoms. Typically the DCRR has about 76% less atomicity than the AAR for the same protein structure. The WZR DSSOLFDWLRQV GLFKRWRPL]H DIWHU WKH ij-ș ELQQLQJ VWHS   $OO FRQVWUXFWLRQV DQG FDOFXODWLRQV ZHUH accomplished by writing and executing Fortran 77 or 90 programs in a UNIX environment. A. Application #1: OL -- IC Separation. With a computational tool to virtually separate the protein OL from its IC, several novel protein structure analytical investigations become more tractable. For instance, the OL can be searched for potential epitopes or protein-protein interactions sites, while the IC can be screened for deep ligand binding sites (LBS) or catalytic sites or otherwise analyzed for its structural role. 1. Conversion of AAR to DCRR. In a regular PDB file, the protein structure is represented in AAR; we first convert it to the "double-centroid reduced representation" (DCRR); we have set up a web server for converting proteins in AAR to DCRR; the URL is http : // tortellini. bioinformatics. rit. edu / vns4483 /dcrr.php [6], [7]. In this reduced representation, each amino acid is represented as two data points: the centroid of the backbone atoms (N, CĮ, C and O), and the centroid of the side-chain atoms (Cȕ and beyond). The centroid of the backbone is calculated simply by finding the average position of the

4 backbone atoms; similarly the centroid of the side-chain is calculated by finding the average position of the side-chain atoms. One advantage of using DCRR instead of AAR for the protein structure is in reducing the noise, or false positives, that may come up during our analysis. 2. Conversion of Cartesian to Spherical Coordinates. The algorithm takes a PDB file, which contains the Cartesian coordinates of the protein, as first input. The second input is the protein molecular centroid, which is the average of the x, y, and z coordinates of the (non-hydrogen) atoms in the protein. The entire protein molecule is then translated so that its centroid is at the origin, (0,0,0). Then the protein Cartesian FRRUGLQDWHVDUHFRQYHUWHGWRVSKHULFDOFRRUGLQDWHV ȡijș using standard equations and implemented in a Fortran 90 program written for the purpose (unpublished). 3. Binning the spherical representation of the protein. The spherical representation of the protein is partitioned, or binned, in two different modes. In the 'fine binning' mode, the protein is partitioned into Û6 -LQWHUYDOV LQ ij DQG 8Û-LQWHUYDOV LQ ș ZKLOH LQ WKH FRDUVH ELQQLQJ  mode, it is partitioned into 10Û-LQWHUYDOVLQijDQG 10Û-LQWHUYDOVLQș6LQFHijJRHVIURPÛWRÛ ZKLOH ș JRHV IURPÛWR Û WKH ILQH ELQQLQJ PRGH \LHOGV  ij-ș ELQV DQG WKH FRDUVH ELQQLQJ PRGH \LHOGV  ij-ș ELQV  7KH ij-ș binning process is illustrated in Figure 1, panels A and B. 4. Separation of the inner core (IC) and the outer layer (OL). The majority of proteins, especially the globular type, have an inner core (IC) analogous to a 'medulla', and an outer layer (OL) analogous to a 'cortex'. In separating the 2/ IURP WKH ,& WKH PD[LPXP ȡ YDOXH LQ HDFK ij-ș ELQ LV ILUVW GHWHUPLQHG 7KHQ IRU HDFK ij-ș Figure 1 A ELQ WKH SURWHLQ FRRUGLQDWHV ZLWK ȡ YDOXHV OHVV WKDQ DQ HPSLULFDOO\ GHWHUPLQHG FXWRII  ȡ YDOXH W\SLFDOO\RURIWKHPD[LPXPYDOXHLQWKHSDUWLFXODUij-șELQ DUHDVVLJQHGWRWKH,&ZKLOHWKRVH ȡYDOXHVHTXDOWRRUJUHDWHUWKDQWKHFXWRII ȡYDOXHDUHDVVLJQHGWRWKH2/1RWHWKDWLIWKHSURWHLQLVLQ AAR, these points are individual atomic coordinates; if the protein is in DCRR, these points are backbone RU VLGH FKDLQ FHQWURLGV  7KLV VHSDUDWLRQ SURFHVV XVLQJ ȡ FXWRII values is illustrated in Figure 2. With the protein in DCRR and its OL separated from its IC, the amino acid residues naturally fall into four classes, and we term them as follows (see Figure 3): (a.) "OL residues", in which case both backbone and side chain centroids are located within the OL; (b.) "IC residues", in which case both backbone and side chain centroids are located within the IC; (c.) "bir" or "boundary inward residues" where the backbone centroid is in the OL while the side chain centroid is in the IC; and (d.) "bor" or "boundary outward residues" where the backbone centroid is in the IC while the side chain centroid is in the OL. However, a much simpler classification is possible if only the side chain centroids are considered as these points will lie either on the IC or OL but not on the boundary (cases (a.) and (b.)). 5. Using the OL to predict candidate epitopes. Our epitope prediction algorithm assumes that potential epitopes are clusters of points (atoms or centroids) on the OL. We thus assign a point density IRUHDFKij-șELQRQWKH2/WKLVQXPEHULVHTXDOWRWKHQXPEHURI Figure 1 B

5

SRLQWV DWRPV RU FHQWURLGV  LQ WKH ij-ș ELQ GLYLGHG E\ WKH DUHD RI WKH ij-ș ELQ :KHUHDVWKHQXPEHURISRLQWVLQDij-șELQ is readily obtained from our algorithm SURJUDPV WKH DUHDV RI WKH ij-ș ELQV YDU\ with respect to their location on the surface, with WKRVH FORVH WR WKH HTXDWRU ij   Û  significantly larger than those lying close to WKH SROHV ij ÛDQG Û   7KH\ DOVR vary with respect to their distances from the protein molecular centroid, with those farther from the molecular centroid (large ȡ V  ODUJHU WKDQ WKRVH FORVHU WR WKH PROHFXODU FHQWURLG VPDOO ȡ V   :H WKXV UHTXLUH D IRUPXOD IRU WKH DUHD RI D ij-ș ELQ that captures these two dependencies (Figure 4A). Such a formula is: Figure 2

Figure 4 A Figure 3

ZKHUH $ijșE LV WKH DUHD RI D ij-ș ELQ WKDW LV ERXQGHG E\ ij1 DQG ij2 DQG E\ ș1 DQG ș2 DQG ȡ LV WKH PD[LPXPȡLQWKDWij-șELQ7KHGHULYDWLRQRIWKLVIRUPXODLVVWUDLJKWIRUZDUGDQGLVOHIWWRWKHUHDGHU7KH SRLQWGHQVLW\'IRUHDFKij-șELQLVWKXVHTXDOWR %$ijșEZKHUH%LVWKHQXPEHURIFRRUGLQDWHVLQthe ijșELQ)RUILQHELQQLQJWKHIDFWRU_ș1-ș2| is 8ÛZKLOHIRUFRDUVHELQQLQJLWLVÛ+HQFH$ijșEGHSHQGVRQO\ RQijDQGȡIRUDJLYHQELQQLQJPRGH$W\SLFDOSORWRISRLQWGHQVLW\ ]-axis) as a funcWLRQRIijDQGș [-y plane) is illustrated in Figure 4B. Peaks in this plot represent clusters of points on the OL of the protein.

6 The final step in the algorithm is in determining candidate epitopes from the point density plots. A cutoff value is determined using a distribution curve that is generated by running the density calculation against a comprehensive set of PDB files. From this distribution curve, the top 10% from the mean, represent potential epitopes. Currently we use the values 8.4 for the coarse binning mode, and 6.6 for the fine binning mode, as minimum point densities in Dij-șELQIRUDSRLQWFOXVWHUWREHFODVVLILHGDV a candidate epitope. 6. Using the IC to find possible protein functional sites. After the protein has been separated into the OL and the IC, the IC can Figure 4 B be analyzed to find deeply buried active or functional sites. Typically proteins fold in a way such that hydrophilic residues are located in the OL while hydrophobic residues are located within the IC of the protein. The IC is thus scanned (using Perl and/or UNIX scripts) for the presence of hydrophilic resides which could be potential buried active or functional sites of the protein. B. Application #2: Surface Topography 1. An Artificial Protein. In order to illustrate the method in an ideal system, we constructed an "artificial protein" in the form of equally-spaced grid of points inside the scalene ellipsoid with equation: 2

2

§ x · § y · § z · ¨ ¸ ¨ ¸ ¨ ¸ © 22 ¹ © 26 ¹ © 30 ¹

2

1.0

The distance between neighboring points is 1.5 units along the x, y and z directions to mimic molecular bond lengths in Å. The three major axes of this scalene ellipsoid are of non-identical lengths, namely: X A X B where X A = (22,0,0) and X B = (-22,0,0) along the x-axis, Y A Y B where Y A = (0,26,0,0) and Y B = (0,26,0) along the y-axis, and Z A Z B where Z A = (0,0,30) and Z B = (0,0,-30) along the z-axis. Thus the shortest and longest dimensions of the ellipsoid are along the x- and the z-axes, respectively. Three variants of the artificial protein were created (Figure 5 A B). The first variant was one with an invagination along the shortest dimension of the elliposoid, at X A = (22,0,0). Meanwhile, the second variant was one with a protrusion along the longest dimension of the ellipsoid, at Z A = (0,0,30). Finally, the third variant was one with both an invagination along the shortest dimension of the ellipsoid, at X A = (22,0,0), and at the same time a protrusion along the longest dimension of the ellipsoid, at Z A = (0,0,30). The invagination at X A = (22,0,0) was created by “scooping out” a hemisphere points inside the sphere with center at X A = (22,0,0) and radius 4.0. The protrusion at Z A = (0,0,30) was created by translating by 2.0 units those points in the ellipsoid lying inside the cylinder with axis z=0 and radius 2.0, and replacing the points vacated on the other end of the ellipsoid at Z B = (0,0,-30) so that there would not be any invaginations there.

7

Figure 5 B

Figure 5 A

2. Frequency Distribution of Maximum Rhos. The set of maximum rho values collected in the last section are arranged from lowest to highest, partitioned, or binned, in 1.50 Å intervals, and the number of maximum rho values in each interval counted. These are then plotted as a frequency distribution, with the rho intervals, or bins, on the x-axis and the frequency in each bin along the y-axis. We call these graphs 'FDMR plots' IRUIUHTXHQF\GLVWULEXWLRQRIPD[LPXPȡ V. This step was applied to both the artificial protein and the real test proteins. III. Preliminary Results. PROTEIN OUTER LAYER (all-atom representation, fine binning)

%hydrophobic

%hydrophilic

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

o

m C

%hydrophilic

C

i

k

%hydrophobic

C

C

a C c C e C g

By

C

Bu Bw

Bs

Bo Bq

Bm

Bi Bk

Bg

Be

Bc

Ay Ba

Au Aw

As

Aq

Ao

Am

Ai Ak

Ag

Ae

Ac

Aa

0

67 test structures

Figure 6 A

A. Application #1: OL -IC Separation 1. Amino Acid Compositions of OL and IC. In the following discussion, we classified the amino acid residues based on the location of their side chains - whether they are located in the OL or IC; the backbone positions were not considered. Thus 'boundary outward' (as described earlier) amino acids were considered to be in the OL, while 'boundary

inward' residues are considered to be in the IC. PROTEIN INNER CORE (all-atom representation, fine binning)

1.2 1 0.8 0.6 0.4 0.2

67 test structures

Figure 6 B

o

m

C

C

i

k

C

C

a C c C e C g

By

C

Bw

Bu

Bs

Bq

Bo

Bm

Bk

Bi

Bg

Be

Bc

Ay Ba

Aw

Au

As

Aq

Ao

Am

Ak

Ai

Ag

Ae

Ac

Aa

0

We used the 67 proteins in the dataset of Laskowski et al. 1996) in the following analyses; each protein corresponds to a pair of bars in the bar graphs in Figures 6 A-D (we just used 2-letter abbreviations: upper case-lower case, for brevity) arranged in the same order as in Table 1 of the said paper. Note that For

8 proteins in AAR, fine binning, OLs of the structures all have significantly higher percentages of hydrophilic amino acids than hydrophobic acids (Figure 6A), except for structure Cc (PDB ID: 2POR), whose OL has slightly more hydrophobic residues. In structures Ap (2YHX), Bm (1HNE), Ch (1MNS) and Cm (2CND), the OL is predominantly hydrophilic as expected, but to an extent that seems to be less than average for the group. On the other hand, the ICs of all the 67 test structures have considerably higher percentages of hydrophobic amino acid residues than hydrophilic residues (Figure 6B); in one case, that of Ck (2ABK, a DNA endonuclease III, [9].), the inner core is 100% hydrophobic. In proteins Ay (1ROB), and PROTEIN OUTER LAYER (reduced representation, coarse binning)

%hydrophobic

%hydrophilic

o

m

C

C

o

i

k

C

C

m

C

C

i

k

C

C

a C c C e C g

By

C

Bu Bw By

a C c C e C g

Bs

C

Bq

Bu

Bw

Bo

Bs

Bi Bk

Bk

Bm

Bg

Bi

Bq

Be

Bg

Bo

Bc

Be

Bm

Ay Ba

Bc

Ba

Au Aw

Ay

As

Aw

Aq

Ao

Am

Au

As

Aq

Ao

Ag

Ae

Ac

Ai Ak

Am

Ak

Ai

Ag

Ae

Ac

Aa

Aa

Cc (2POR), the IC is predominantly hydrophobic 0.9 as expected, but to an 0.8 extent that seems to be 0.7 less than average for the 0.6 0.5 group. We also performed 0.4 coarse binning for proteins 0.3 in AAR, and the results are 0.2 very similar to the above 0.1 (see Appendix, Figures 0 S1a and S1b). For 67 test structures proteins in DCRR, coarse binning, the OLs of the Figure 6 C structures have all have significantly higher percentages of hydrophilic amino acid residues than hydrophobic residues (Figure 6 C), except for structures Cc (2POR) and Cm (2CND), which have a higher percentage of hydrophobic residues. In structures As (2CUT), Bk (2ALP) and Bm (1HNE) the %hydrophobic %hydrophilic PROTEIN INNER CORE (reduced representation, coarse binning) OL is predominantly hydrophilic as expected, but 0.8 to an extent that seems to be 0.7 less so than the average for 0.6 the group. On the other 0.5 hand the ICs of all the 0.4 0.3 structures have higher 0.2 percentages of hydrophobic 0.1 amino acid residues than 0 hydrophilic ones (Figure 6 D), except in Aw (1ONC) and 67 test structures Ay (1ROB) which are predominantly hydrophilic. In Figure 6 D Av (1RNH), Az (1SNC) and Ci (3PGM), the inner core is still predominantly hydrophobic but the difference seems to be less than the average for the group. In summary, the major exception cases for the OL (i.e., higher percentage of hydrophobic than hydrophilic amino acid residues) are 2POR and 2CND, while the major exception cases for the IC (i.e., higher percentage of hydrophilic than hydrophobic amino acids) are 1ONC and 1ROB. Protein 2POR is porin [10], while 2CND is nitrate reductase from corn (Zea mays) [11]; both are integral membrane proteins [12] that must have hydrophobic OLs. On the other hand, protein 1ONC is P-30 [13], an amphibian ribonuclease, while 1ROB is bovine ribonuclease A [14]. Type A ribonucleases such as 1ONC and 1ROB are known to be composed of two flaps or flattened lobes, at the interface of which lie positively charged residues which together bind the negatively charged RNA substrate, thus explaining their hydrophilic ICs. Taken together, the above results clearly demonstrate that our spherical coordinate system-dependent OL-IC separation algorithm does its job with reasonable accuracy. 2. Finding Candidate Epitopes. We tested our epitope prediction method using the dataset of 67 proteins of Laskowski [5] using both fine binning and coarse binning modes on the proteins represented in either AAR or DCRR. The coarse binning mode consistently predicted more candidate epitopes than

9 the fine binning mode [data not shown; 15]. This illustrates the ability of the algorithm to produce either more sensitive or more selective results depending on which binning method is chosen. When these proteins were cross-referenced against the Immune Epitope Database (IEDB, [16]), it was found that all of these proteins contained true epitopes and were antigenic in some capacity. Discrepancies between the amount of predicted epitopes and true epitopes might be due to several factors, including: overly sensitive candidate epitope cutoff criteria, the incomplete and ever-changing nature of the IEDB and PDB databases, and protein-protein interaction sites that appear to be candidate epitope according to the algorithm. The procedure and the parameters are currently being refined. . 3. Results for prediction of buried active sites. The analysis of finding possible deeply buried active sites was run against the same set of 67 proteins [5] referred to above and compared to the set of catalytic sites curated in the Catalytic Site Atlas (CSA, [17]). The number of predicted deep active sites in the IC for each protein represented in either in AAR or DCRR using the coarse and the fine binning modes were tabulated [data not shown; 18]. The coarse binning produced a larger number of predicted active sites than the fine binning mode for each protein analyzed. Again, this illustrates the ability for the algorithm to produce either more sensitive or more selective results depending on which binning method is chosen. Again, the procedure and the parameters are currently being refined. 4. Refinement Tests for Epitope Prediction: Fine vs. Coarse Binning and AAR vs. DCRR. The fine and coarse binning methods (fb and cb, respectively) were compared with each other with the protein in AAR as well as in DCRR; we designate these combinations as AAR/fb, AAR/cb, DCRR/fb and DCRR/cb. The algorithm outputted a reduced number of candidate epitopes with AAR/fb and AAR/cb compared to WKH RWKHU WZR VLQFH WKH ij-ș ELQ VL]H LV WRR VPDOO DQG WKXV DWRPV LQ DPLQR DFLG UHVLGXHV Dre being split between neighboring bins, or amino acid residues themselves are being precluded from becoming clustered in a single bin, thus preventing the algorithm form "seeing" (i.e., detecting) it. This complication is further compounded when atoms iQDQDPLQRDFLGUHVLGXHDUHVSOLWEHWZHHQWKH2/DQGWKH,&RIDij-ș bin. We found that using proteins in DCRR greatly reduced the above ambiguities in the algorithm output (data not shown). When proteins are in DCRR, the amino acid residues may be classified as either in the OL or IC depending on where their side chain centroids are on, which is almost never exactly on the OLIC boundary. But when the backbone centroids are taken into account, the amino acid residues may be of four types: (a.) both backbone and side chain centroids are located within the OL; (b.) both backbone and side chain centroids are located within the IC; (c.) the backbone centroid is in the OL while the side chain centroid is in the IC; and (d.) the backbone centroid is in the IC while the side chain centroid is in the OL (see Figure 3). We term the first case "OL residues", the second case "IC residues", the third case "boundary inward residues" (bir's) and the fourth case "boundary outward residues" (bor's). To sum up, we conclude that the best combination method is DCRR/cb, although use of the other three combinations in some special cases (e.g., small proteins or structured peptides) is not ruled out. B. Application #2: Surface Topography 1. Use of Artificial Protein to Test Method of Identifying Invaginations and Protrusions on Protein Surface. Before applying our procedure to 250 the 67 proteins in our test set, we applied it to the 200 artificial protein we have created and its three variants (see Figure 5). The resulting plots are shown in Figure 150 7 A-D. The plot for the intact ellipsoid in panel A 100 shows a lone main peak without any shoulders or subpeaks anywhere in the plot. The plot for the first 50 variant with the invagination in panel B clearly shows 0 a shoulder or subpeak on the lagging side of the main 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 rho intervals (1.5 A each) peak. A shoulder or subpeak on the lagging side of the main peak in the frequency distribution of Figure 7 A PD[LPXP ȡ YDOXHV LV WKHUHIRUH GLDJQRVWLF RI DQ invagination, especially if this invagination is along the shortest dimension of the ellipsoid. If the invagination does not lie along the shortest dimension, it may be masked in the plot by some points along the shortest dimension, and a subpeak on the lagging side of scalene ellispod, intact

frequency

300

10 the main peak will hardly be visible. In such case, the invagination will manifest itself as a noncoincidence between the superimposed FDMR plots of the intact ellipsoid and that of the invaginationconaining variant elsewhere along the plot. scalene ellipsoid w/ invagination 300 250

frequency

200 150 100 50 0 1

3

5

7

9

11 13 15 17 19 21 23 25 27 29 31 33 35 rho intervals (1.5 A each)

Figure 7 B

The plot for the second variant with the protrusion in panel C, on the other hand, clearly shows a shoulder or a subpeak on the leading side of the main peak. A shoulder or subpeak on the leading side of the main SHDN LQ WKH IUHTXHQF\ GLVWULEXWLRQ RI PD[LPXP ȡ values is therefore diagnostic of a protrusion, especially if the protrusion is along the longest dimension of the ellipsoid. If the protrusion does not lie along the longest dimension, it may be masked in the plot by some points along the longest dimension, and a subpeak on the lagging side of the main peak will hardly be visible. In that case, the protrusion will manifest itself as a non-coincidence between the superimposed FDMR plots of the intact ellipsoid and that of the protrusion-containing variant elsewhere

along the plot. scalene ellipsoid w/ protrusion

Finally, the plot for the third variant with both the invagination and protrusion in panel D shows a shoulder or a subpeak on both the lagging and leading sides of the main peak. As mentioned in the previous two paragraphs, the invagination and protrusion manifest themselves as subpeaks or shoulders on both the lagging and leading sides of the main peak, respectively, especially if they are along the shortest and longest dimension of the ellipsoid, respectively.

300 250

frequency

200 150 100 50 0 1

3

5

7

9

11

13

15

17

19

21

23

25

27

rho intervals (1.5 A each)

29

31

33

35

Taken together, from the above results

it may be

concluded that, in the general case where invaginations and protrusions lie on random locations on the ellipsoid surface, the superimposed FDMR plots of the smooth ellipsoid and those of its protrusion- or invagination-containing variants will display segments of non-coincidences on specific corresponding parts of the plots. Figure 7 C

scalene ellipsoid w/ invagination & protrusion

2. Application to Real Proteins in the Laskowski Data Set. Due to space limitations, we can only show 250 four sets of data here corresponding to four proteins in the Laskowski data set (refer to Table 1). In these 200 results, the FDMRs of the liganded protein is 150 superimposed with its unliganded form (i.e., ligand/s 100 deleted from the PDB file before algorithm implementation). Figure 8 A shows the superimposed 50 FDMR plots for proteins Cd (1CIL). A subpeak appears 0 on the lagging side of the main peak of the unliganded 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 rho intervals (1.5 A each) relative to the liganded form. Figure 8 B shows the plots for protein Bc (1XNB). A subpeak appears on the Figure 7 D leading side of the main peak of the unliganded relative to the liganded form. We interpret these results as local changes in the protein structure, giving rise to invaginations and protrusions, respectively, upon ligand binding,. Structure 1CIL is that of human carbonic anhydrase II [19], while 1XNB is that of a bacillus xylanase with a bound sulfate ion [20]. frequency

300

11

Figure 8 A

Figure 8 C shows the plots for protein Co (4BCL). We notice a general widespread noncoincidence between the two superimposed plots. For protein Bg (1BLL; data not shown), we observe a clean coincidence of the two superimposed plots. Protein 4BCL is that of FMO antennae protein from green sulfur bacteria [21], while 1BLL is that of bovine lens leucine aminopeptidase [22]. We are continuing to analyze the precise structural meanings of these last two cases as we carefully refine our methods for spherical coordinate protein structure representation. A more comprehensive analysis and refinement of this application will be published in a future submission. IV. Concluding Remarks.

A. Conclusions and Future Directions. We have utilized the spherical coordinate system as an alternative to the Cartesian system to represent protein 3D structures. Using this representation, we have developed a way to separate the protein outer layer (OL) from the protein inner core (IC). Being able to separate the OL from the IC allowed us to investigate surface properties (from the OL) and buried properties (from the IC) of the protein systematically and independently. For example, we were able to predict potential Figure 8 B epitopes and protein-protein interaction sites from the OL, and deeply buried functional sites such as catalytic residues in the IC. We are convinced that our separation method works properly and is valid because when applied to a test set of 67 protein structures by Laskowski et al. (1996) [5], we found that all but a few have OLs that are significantly enriched with hydrophilic amino acids and ICs that are significantly enriched with hydrophobic amino acids. To the best of our knowledge, this is the first time that spherical coordinate representation has been utilized in protein 3D structural analysis. Future directions for this project will include use of cylindrical coordinates to represent rodFigure 8 C shaped proteins and viruses and its applications. Two web servers, namely, the "ProtMedCor Web Server" and "ProtSurfTop Web Server" are being set up to implement the OL-IC separation algorithm and the protrusion/ invagination detection algorithm, respectively, for public access. We have also been able to get the cylindrical coordinate transformation procedure off the ground (Specific Objective #4). The procedure is composed of four general steps, namely: (1.) starting with a rod-shaped protein in Cartesian coordinates, use any available graphics software to visually select the

12

extreme pints of the molecule, Pa and Pb (i.e., the points at the two "tips", Figure 9A); (2.) translate the coordinates such that extreme point Pb is the new origin (Figure 9B); (3.) using vector algebra (i.e., using vector dot and FURVVSURGXFWV ILQGWKHDQJOHșEHWZHHQ vector PbPa and the positive z-axis, then rotate the coordinates such that vector PbPa now coincides with the positive z-axis while Pb is still at the origin (for details, see Appendix, Note S1). (Figure 9C); and (4.) perform a transformation of coordinates from &DUWHVLDQ [\]  WR F\OLQGULFDO Uș] (Figure 9D) using standard equations. The entire algorithm is now coded in Fortran 90 and has been tested exhaustively to work correctly; it is now ready for testing on a large dataset and for further applications. Figure 9 A

B. Preservation/Documentation/Sharing of Data & Related Research/Education Products. Results of our work from this project will first be presented as posters in scientific conferences, and, after further development and refinement, will be written up as manuscripts for publication in peerreviewed scientific journals. They will be made available to the community in the form of databases and web servers. Unpublished results and other details will be available upon request from the PI; those which are deemed patentable by our institute's intellectual property management office will be put up for patent and made available to academic or non-profit requestors upon formal agreement of confidentiality. This project is especially useful pedagogically since it is a new way to look at and analyze protein 3D structures. V. Broader Impact. The author teaches courses in proteomics and bioinformatics and has always thought about how to best integrate basic research into classroom teaching. At RIT he has initiated a researched-based teaching method in order to advance discovery while promoting teaching, training and learning. In this method, the students are given a "freestyle" laboratory exercise involving an open problem in the field. He will present the successful results of this innovative teaching technique in a video clip at ICERI 2010 (International Conference on Education, Research and Innovation). The author will continue to improve and refine his research-based pedagogical method as he continues to teach more and more varying courses at the institute. As for the broadening the participation of Figure 9 B

underrepresented groups, we note that our institute

13 has been the home of the National Technical Institute for the Deaf (NTID) since 1967. As such we have the unique opportunity to increase the participation of this disadvantaged group into our research programs. Our institute also attracts and serves many talented African- and Latino-American students, as well as foreign nationals; for example, we attract many talented students from Malaysia and India. As a means to enhance infrastructure for basic research and education, we propose to create databases and web servers that would store, visually display, and create files for protein 3D structures in spherical and cylindrical coordinates for globular and rod-shaped proteins and viruses, respectively; it is hoped that these two novel resources will be useful not only to researchers in the computational biology field but also for K-12 students and educators. The author will strongly encourage his graduate students and post-doc to give research talks in one of several seminar series in place within our the institute as well as within the Rochester-Buffalo-Ithaca area. The author along with his post-doc, technician and graduate students plan to actively participate in national as well as international meetings in order to disseminate their work. These activities will be the prelude to the publication of their findings in primary, peer-reviewed scientific journals. Among the specific applications of this proposed work is in silico epitope mapping, which has significant implications on vaccine development. The datasets being used by the author and his students currently are proteins of human origin, and it is hoped that this human focus would benefit society in general. Finally, the author intends to make sure that his postdoc, technician and undergraduate and graduate students are all trained in the scientific method and way of thinking, so they Figure 9 C can become the independent scientific researchers of the future. VI. Management Plan. In addition to regular e-mail communications, the author will hold 1- to 2-hour meetings every other week with his graduate students, post-doc and technician for research progress reports and other pertinent matters. In these meetings, everyone, especially the author, will brainstorm to find solutions to any problems anyone is having with his/her research, and to present new ideas for solving/approaching research problems on hand. These meetings will also be a forum where anyone can pose new scientific questions (those within the realms of the author's scientific program) for possible analysis and proposed solution by everyone in the group. Occasional lectures by the author dealing with certain special topics relevant to his research program will also be held during these meetings. These meetings will ultimately serve as a training ground for the students, post-doc and technician in the scientific method and way of thinking. Financial and administrative matters related to the grant will be handled in coordination with the institute's Office of Sponsored Research. Figure 9 D

VII. Miscellaneous Items.

14

A. Keywords: protein outer layer; protein inner core; computational epitope mapping; spherical coordinate system, protein double-centroid reduced representation; protein functional site prediction; clustering algorithm B. Abbreviations: IC, inner core (of folded protein); OL, outer layer (of folded protein), AAR, all-atom representation; DCRR, double-centroid reduced representation; fb, fine binning; cb, coarse binning; LBS, ligand binding site; bir, boundary-inward residue; bor, boundary-outward residue; FDMR, frequency GLVWULEXWLRQRIPD[LPXPȡ V; ijșESKL-theta bin; AijșE, surface area of a phi-theta bin C. Definitions: Outer Layer: When a linear protein chain folds in three-dimensions, creating a globular molecule, amino acid residues that end up on the outer surface of the folded molecule are mostly of hydrophilic nature, interacting with the aqueous cytoplasmic environment; amino acid residues making up this (usually) exterior part of the folded protein we term here the "outer layer" or "OL". Inner Core: When a linear protein chain folds in three-dimensions, creating a globular molecule, amino acid residues that end up in the interior of the folded molecule are mostly of hydrophobic nature, avoiding the aqueous cytoplasmic environment; amino acid residues making up this (usually) interior part of the folded protein we term here the "inner core" or "IC". Fine binning: a partitioning mode such that the individual partition intervals or "bins" are small in width or are narrow; in our procedure here, we partition a protein 3D structure in spherical coordinates with UHVSHFWWRERWKij ODWLWXGHV DQGș ORQJLWXGHV LQILQHELQQLQJijDUHSDUWLWLRQHGLQWRÛ -intervals, while șLVSDUWLWLRQHGLQWRÛ-intervals Coarse binning: a partitioning mode such that the individual partition intervals or "bins" are large in width or are wide; in our procedure here, we partition a protein 3D structure in spherical coordinates with UHVSHFW WR ERWK ij ODWLWXGHV  DQG ș ORQJLWXGHV  LQ FRDUVH ELQQLQJ ERWK ij DQG ș DUH SDUWLWLRQHG LQWR 10Û-intervals Phi-Theta Bin: $IWHUELQQLQJ ILQHRUFRDUVH DSURWHLQ'VWUXFWXUHZLWKUHVSHFWWRijDQGșWKHUHVXOWV are phi-WKHWDELQVWKH\DUHERXQGHGRQWRSDQGEHORZ QRUWKDQGVRXWK E\DGMDFHQWijYDOXHVDQGWR WKHOHIWDQGWRWKHULJKW HDVWDQGZHVW E\DGMDFHQWș values; a phi-theta bin is a 3D figure similar but not identical to a square pyramid, with its "base" being a sector of a sphere bounded as above, and its "vertex" the center of the sphere Boundary-Inward Residue: After applying our OL-IC separation procedure to a protein 3D structure in DCRR, some amino acid residues end up being "split" between the OL and the IC, i.e., they are considered to be on the "boundary" (of the OL and the IC): the side chain centroid may be in the IC and the backbone centroid may be in the OL, or vice versa; this is the former case; abbrev.: 'bir' Boundary-Outward Residue: same as above, but the latter case; abbrev.: 'bor' All-Atom Representation: The usual representation of protein 3D structures, as in the PDB, where each atom (usually the non-Hydrogen ones) has its own coordinates, usually Cartesian, (x,y,z). Double-Centroid Representation: A protein 3D reduced representation wherein each amino acid is represented by two centroids: that of the baFNERQHDWRPV 1&Į& 2 DQGWKDWRIWKHVLGHFKDLQDWRPV &ȕDQGEH\RQG Frequency Distribution of Maximum Rho's: After performing phi-theta binning, the maximum rho in each bin are gathered, sorted, partitioned into intervals (each typically 1.50 Å in width), and then the number of rho's in each interval plotted; this is the resulting plot

15 VIII. Appendix. PROTEIN OUTER LAYER (all-atom representation, coarse binning)

%hydrophobic

%hydrophilic

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

o

m

C

C

i

k

C

C

a C c C e C g

By

C

Bu Bw

Bs

Bq

Bo

Bm

Bi Bk

Bg

Be

Bc

Ay Ba

Au Aw

As

Aq

Ao

Am

Ai Ak

Ag

Ae

Ac

Aa

0

67 test structures

Figure S1a. Protein OL composition (% hydrophobic, cyan, and %hydrophilic, maroon) , using AAR with coarse binning, showing similar results to AAR with fine binning, and DCRR with coarse binning, earlier.

PROTEIN INNER CORE (all-atom representation, coarse binning)

%hydrophobic

%hydrophilic

1.2 1 0.8 0.6 0.4 0.2

o

m

C

C

i

k

C

C

a C c C e C g

By

C

Bw

Bu

Bs

Bq

Bo

Bm

Bk

Bi

Bg

Be

Bc

Ay Ba

Aw

Au

As

Aq

Ao

Am

Ak

Ai

Ag

Ae

Ac

Aa

0

67 test structures

Figure S1b. Protein IC composition (% hydrophobic, cyan, and %hydrophilic, maroon) , using AAR with coarse binning, showing similar results to AAR with fine binning, and DCRR with coarse binning, earlier. +

Note S1. In Figure 9B, WRILQGWKHDQJOHșEHWZHHQYHFWRU3bPa and the positive z axis, Oz , divide the dot product of the vector PbPa and with the product of their respective magnitudes; the arccosine RIWKLVQXPEHULVWKHDQJOHș7RURWDWHWKHD[HVVXFKWKDWYHFWRU3bPa would coincide with the positive z + axis, Oz ZHQHHGWRILQGWKHURWDWLRQPDWUL[FRUUHVSRQGLQJWRDURWDWLRQșDERXWDQD[LVLQWKHGLUHFWLRQRI a unit vector ~u such that ~u is perpendicular to the plane determined by vectors PbPa and ; this vector is simply their cross product divided by its magnitude (a unit vector), and the rotation matrix R is

ZKHUHF FRVșDQGV VLQșDQGXx, uy and uz are the components of the unit vector, ~u [24].

1

REFERENCES [1] http:// www. math montana..edu/ frankw/ ccp/ multiworld / multipleIVP/ spherical/ learn.htm [2] Gershoni JM, Roitburd-Berman A, Siman-Tov DD, Tarnovitski Freund N, Weiss Y (2007). Epitope mapping: the first step in developing epitope-based vaccines. BioDrugs;21(3):145-56. [3] Tarnovitski N, Matthews LJ, Sui J, Gershoni JM, Marasco WA (2006). Mapping a neutralizing epitope on the SARS coronavirus spike protein: computational prediction based on affinity-selected peptides. J Mol Biol. 359(1):190-201. [4] B. Mumey, T. Angel, B. Kirkpatrick, B. Bailey, P. Hargrave, A. Jesaitis, E. Dratz (2003). Mapping Discontinuous Antibody Epitopes to Reveal Protein Structure and Changes in Structure Related to Function. IEEE Computer Society Bioinformatics Conference (CSB'03), 2003: 585. [5] Laskowski, RA, Luscombe, NM, Swindells, MB, and Thornton JM (1996) "Protein Clefts in Recognition and Function" Prot. Sci., 5:2438-2452 [6] Sheth VN "Visualization of Protein 3D Structures in Reduced Representation with Simultaneous Display of Intra- and Intermolecular Interactions," M.S. Thesis, Oct. 2009, Dept. of Biology, Program in Bioinformatics, Rochester Institute of Technology, .com Rochester , NY 14623, U.S.A. [7] Reyes, VM. & Sheth V.N., "Visualization of Protein 3D Structures in 'Double-Centroid' Reduced Representation: Application to Ligand Binding Site Modeling and Screening.", Chap. 23, in: Handbook of Research on Computational and Systems Biology: Interdisciplinary Applications, Springer, Shanghai China, 2010, L.A. Liu, D. Wei, & Y. Li, eds. (in press). [8]

http://www.ferris.edu/faculty/burtchr/sure452/notes/ geometry_on_the_sphere.pdf

[9] http://math.rice.edu/~pcmi/sphere/ [9] Thayer MM, Ahern H, Xing D, Cunningham RP, Tainer JA. "Novel DNA binding motifs in the DNA repair enzyme endonuclease III crystal structure." EMBO J., vol. 14(16), pp. 4108-20, Aug. 1995. [10] Weiss MS, Schulz GE. "Structure of porin refined at 1.8 A resolution." J Mol Biol., vol. 227(2), pp. 493-509, Sept. 1992. [11] Lu G, Lindqvist Y, Schneider G, Dwivedi U, Campbell W. "Structural studies on corn nitrate reductase: refined structure of the cytochrome b reductase fragment at 2.5 A, its ADP complex and an active-site mutant and modeling of the cytochrome b domain." J Mol Biol., vol. 248(5), pp. 931-948, May 1995. [12] http://www.worthington-biochem /nar/default.html [13] Mosimann SC, Ardelt W, James MN. "Refined 1.7 A X-ray crystallographic structure of P-30 protein, an amphibian ribonuclease with anti-tumor activity." J Mol Biol. vol. 236(4), pp. 1141-53 [14] Lisgarten JN, Gupta V, Maes D, Wyns L, Zegers I, Palmer RA, Dealwis CG, Aguilar CF, Hemmings AM. "Structure of the crystalline complex of cytidylic acid (2'-CMP) with ribonuclease at 1.6 A resolution. Conservation of solvent sites in RNase-A high-resolution structures." Acta Crystallogr D Biol Crystallogr., vol. 49(6), pp. 541-7, Nov. 1993 [15] MacCreary, M., 2010, unpublished results

2

[16] Vita R, Zarebski L, Greenbaum JA, Emami H, Hoof I, Salimi N, Damle R, Sette A, Peters B. The immune epitope database 2.0. Nucleic Acids Res. 2010 Jan;38(Database issue):D854-62. [17] Porter, CT, Bartlett, GJ, and Thornton, JM "The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data". (2004) Nucl. Acids. Res. 32: D129-D133 [18] Kim, D.J., 2010, unpublished results [19] Smith GM, Alexander RS, Christianson DW, McKeever BM, Ponticello GS, Springer JP, Randall WC, Baldwin JJ, Habecker CN., "Positions of His-64 and a bound water in human carbonic anhydrase II upon binding three structurally related inhibitors." Protein Sci., vol. 3(1), pp. 118-25, Jan. 1994. [20] R.L.Campbell, D.R.Rose, W.W.Wakarchuk,R.J.To, W.Sung, M.Yaguchi, "High-Resolution Structures of Xylanases from B. Circulans and T. Harzianum Identify a New Folding Pattern and Implications for the Atomic Basis of the Catalysis," unpublished. [21] Tronrud DE, Wen J, Gay L, Blankenship RE., "The structural basis for the difference in absorbance spectra for the FMO antenna protein from various green sulfur bacteria", Photosynth Res., vol. 100(2), pp. 79-87, May 2009. [22] Kim H, Lipscomb WN. "X-ray crystallographic determination of the structure of bovine lens leucine aminopeptidase complexed with amastatin: formulation of a catalytic mechanism featuring a gem-diolate transition state." Biochemistry, vol. 32(33), pp. 8465-78, Aug. 1993. [23]. Reyes, V.M., "Representation of Protein 3D Structures in Spherical (ρ,φ,θ) Coordinates and Two of Its Potential Applications" (accepted for publication, iCCSB 2010). [24] http://en.wikipedia.org/wiki/Rotation_matrix

1 VICENTE M. REYES, Ph.D. Dept. Biological Sciences, Sch. Biol. & Med. Sciences College of Science, Gosnell 08-1336 Rochester Institute of Technology Rochester, NY 14623-5603

Tel: (585) 475-4115 Cell: (619) 212-9131 E-mail: [email protected] [email protected]

(a) Professional Preparation:

• Univ. of the Philippines, Diliman, Philippines (conc. in pure mathematics & operations research), Mathematics, B.S. (magna cum laude), 1980 • Univ. of the Philippines, Diliman, Philippines (conc. in organic chemistry & biochemistry), Chemistry, B.S. (magna cum laude), 1980 • California Institute of Technology, Pasadena, California, USA, (conc. in molecular biology & biochemistry), Chemistry, Ph.D., 1988 • UCSD School of Extended Studies, Spec. Cert. In Bioinformatics (Spring 2002) • UCSD School of Extended Studies, Prof. Cert. in Bioinformatics (Spring 2004) • UCSD School of Extended Studies, Spec. Cert. in Data Mining (Winter 2007) (b) Appointments:

• Assistant Professor, Dept. of Biological Sciences, SBMS, COS, R.I.T., 9/2008- present (computational biology/bioinformatics)

• IRACDA Postdoctoral Fellow & Assistant Project Scientist, UCSD Dept. of Pharmacology, SOM,

2004-'08 (computational biology/structural bioinformatics) • Structural Bioinformatics Researcher, San Diego Supercomputer Center, 2002-‘04 (structural bioinformatics) • Bioinformatics studies, UCSD School of Extended Studies, La Jolla, CA, 2000-‘02 (general bioinformatics) • Senior Research Associate, The Scripps Research Institute, La Jolla, CA, 1995-‘00 (protein x-ray crystallography; structure-based drug design) • Postdoctoral Biochemist, Dept. of Chem. & Biochem., UCSD, La Jolla, CA, 1992-’95 (protein x-ray crystallography; structural enzymology) • Postdoctoral Biologist, Dept’s. of Biol. & Med., UCSD, La Jolla, CA 1990-‘92 (HIV/AIDS molecular biology) • Postdoctoral Research Fellow, Lab.Tumor Cell Biol., NCI/NIH, Bethesda, MD 1988-‘89 (HIV/AIDS molecular biology) • Graduate Student & Teaching Assistant, Dept. of Biol., CIT, Pasadena, CA 1983-‘88 (gene expression/molecular biology) • Instructor in Mathematics, Dept. of Math., Univ. of the Phils., Diliman, Phils., 1980-‘82 (differential and integral calculus I, II, and III; probability & statistics) (c) Publications:

(i) Most closely related to proposed project: • Reyes, V.M., "Representation of Protein 3D Structures in Spherical (ρ,φ,θ) Coordinates and Two of Its Potential Applications" (accepted for publication, iCCSB 2010). • Reyes, V.M*. & Sheth, V.N., "Visualization of Protein 3D Structures in 'Double-Centroid' Reduced Representation: Application to Ligand Binding Site Modeling and Screening", Handbook of Research in Computational and Systems Biology: Interdisciplinary Approaches, IGI-Global/Springer (*corresponding author; in press). • Reyes, V.M., "Modeling Protein-Protein Interface Interactions as a Means for Predicting ProteinProtein Interaction Partners." J. Biomol. Struct. & Dyn., Book of Abstracts, Albany 2009: The 16th Conversation, June 16-20 2009, Vol. 26 (6) June 2009, p. 873

2 • Reyes, V.M., "Pharmacophore Modeling Using a Reduced Protein Representation as a Tool for Structure-Based Protein Function Prediction", J. Biomol. Struct. & Dyn., Book of Abstracts, Albany 2009: The 16th Conversation, June 16-20 2009, Vol. 26 (6) June 2009, p. 873 • Reyes, V.M., "Representing Protein 3D Structures in Spherical Coordinates – Two Applications: 1. Detection of Invaginations, Protrusions, and Potential Ligand Binding Sites; and 2. Separation of Protein Hydrophilic Outer Layer from the Hydrophobic Core ", J. Biomol. Struct. & Dyn., Book of Abstracts, Albany 2009: The 16th Conversation, June 16-20 2009, Vol. 26 (6) June 2009, pp. 874-5 (ii) Other Significant Publications (after 1990): • Reyes, V.M., "Pharmacophore Modeling Using a Reduced Protein Representation: Application to the Prediction of ATP, GTP, Sialic Acid, Retinoic Acid, and Heme-Bound and -Unbound Nitric Oxide Binding Proteins", J. Biomol. Struct. & Dyn., Book of Abstracts, Albany 2009: The 16th Conversation, June 16-20 2009, Vol. 26 (6) June 2009, p. 874 • Li, W., Byrnes, R.W., Hayes, J., Birnbaum, A., Reyes, V.M., Shahab, A., Mosley, C., Pekurovsky, D., Quinn, G.B., Shindyalov, I.N., Casanova, H., Ang, L., Berman, F., Arzberger, P.W., Miller, M., Bourne, P.E. “The Encyclopedia of Life Project: Grid Software and Deployment.” New Gener. Comp. (2004) 22:127-136. • Reyes, V.M., Greasley, S.E., Stura, E.A., Beardsley, G.P., Wilson, I.A. “Crystallization and preliminary crystallographic investigations of avian 5-aminoimidazole-4-carboxamide ribonucleotide transformylaseinosine monophosphate cyclohydrolase expressed in Escherichia coli.” Acta Crystallogr D Biol Crystallogr. (2000) Aug;56 (Pt 8):1051-4. • Lee, H., *Reyes, V.M., Kraut, J. “Crystal structures of Escherichia coli dihydrofolate reductase complexed with 5-formyltetrahydrofolate (folinic acid) in two space groups: evidence for enolization of pteridine O4.” Biochemistry. (1996) Jun 4;35(22):7012-20. (*corresponding author) • Reyes, V.M., Sawaya, M.R., Brown, K.A., Kraut, J. “Isomorphous crystal structures of Escherichia coli dihydrofolate reductase complexed with folate, 5-deazafolate, and 5,10-dideazatetrahydrofolate: mechanistic implications.” Biochemistry. (1995) Feb 28;34(8):2710-23. (d) Synergistic Activities: • Member, External Faculty, Ph.D. Program of the Golisano Institute of Computing and Information Sciences, R.I.T. (Prof. P.-C. Shi, Dept. of Computer Science, director) • Member, Center for Applied and Computational Mathematics, Dept. of Mathematics, R.I.T. (Prof. A. Harkin, Dept. of Mathematics, director) • Rochester Inst. of Techn./Rochester Gen. Hosp. Biomedical Research & Programs Alliance, Sum. 2009 • Encyclopedia of Life/Dictyostelium discoideum proteome project with Prof. W. Loomis, UCSD Dept. of Biology, 2003-2004, and Drs. W. Li, G. Quinn & P. Bourne, SDSC, 2002-2006 • Bio 101 team-teaching project, directed by Prof. R. Pozos, SDSU, under IRACDA program, 2004-2008. (e) Collaborators & Other Affiliations: (i) Collaborators: • Paul Craig (RIT): Proteomics team-teaching, spring 2009 and 2010) • Lea Michel (RIT) & Michael Pichichero (RGH): vaccine development project, summer 2009 (ii) Graduate/Postdoctoral Advisors: • Ph.D. Dissertation Advisor: Prof. John Abelson, Dept. of Biology, Caltech • Postdoctoral Research Mentor: Prof. Joseph Kraut, Dept. of Chem. & Biochem., U.C. San Diego • Postdoctoral Research Sponsors: Drs. F. Wong-Staal/R. Gallo (NCI/NIH); Dr. I. Wilson (TSRI); Drs. L. Brunton (UCSD)/R. Pozos (SDSU)/P. Bourne (SDSC) (iii) Thesis Advisees & Mentees (Total: 9): (1) Vrunda Sheth (M.S., graduated 2009; Applied Biosystems); (2) Mark McCreary (M.S., RIT, current); (3) Arkanjan Banerjee (M.S., RIT, current); (4) Srujana Reddy Cheguri (M.S., RIT, current); (5) Dong Jin Kim (M.S., RIT, current); (6) Andrew Clark (M.S., RIT, current); (7) Wan Munirah Wan Mohamad (B.S., RIT, current); (8) Madolyn MacDonald (B.S., RIT, current); (9) Muhamad Hanafi Hazemi (B.S., graduated 2010, RIT)

SUMMARY PROPOSAL BUDGET

YEAR

1

FOR NSF USE ONLY PROPOSAL NO. DURATION (months) Proposed Granted AWARD NO.

ORGANIZATION

Rochester Institute of Tech PRINCIPAL INVESTIGATOR / PROJECT DIRECTOR

Vicente M Reyes A. SENIOR PERSONNEL: PI/PD, Co-PI’s, Faculty and Other Senior Associates (List each separately with title, A.7. show number in brackets)

NSF Funded Person-months

CAL

ACAD

1. Vicente M Reyes - PI 0.00 0.50 2. 3. 4. 5. 6. ( 0 ) OTHERS (LIST INDIVIDUALLY ON BUDGET JUSTIFICATION PAGE) 0.00 0.00 7. ( 1 ) TOTAL SENIOR PERSONNEL (1 - 6) 0.00 0.50 B. OTHER PERSONNEL (SHOW NUMBERS IN BRACKETS) 1. ( 1 ) POST DOCTORAL SCHOLARS 11.76 0.00 2. ( 1 ) OTHER PROFESSIONALS (TECHNICIAN, PROGRAMMER, ETC.) 6.00 0.00 3. ( 4 ) GRADUATE STUDENTS 4. ( 2 ) UNDERGRADUATE STUDENTS 5. ( 0 ) SECRETARIAL - CLERICAL (IF CHARGED DIRECTLY) 6. ( 0 ) OTHER TOTAL SALARIES AND WAGES (A + B) C. FRINGE BENEFITS (IF CHARGED AS DIRECT COSTS) TOTAL SALARIES, WAGES AND FRINGE BENEFITS (A + B + C) D. EQUIPMENT (LIST ITEM AND DOLLAR AMOUNT FOR EACH ITEM EXCEEDING $5,000.)

$

Terabyte Storage (3 terabytes of storage space)

TOTAL EQUIPMENT E. TRAVEL 1. DOMESTIC (INCL. CANADA, MEXICO AND U.S. POSSESSIONS) 2. FOREIGN

F. PARTICIPANT SUPPORT COSTS 0 1. STIPENDS $ 0 2. TRAVEL 0 3. SUBSISTENCE 0 4. OTHER TOTAL NUMBER OF PARTICIPANTS ( 0) G. OTHER DIRECT COSTS 1. MATERIALS AND SUPPLIES 2. PUBLICATION COSTS/DOCUMENTATION/DISSEMINATION 3. CONSULTANT SERVICES 4. COMPUTER SERVICES 5. SUBAWARDS 6. OTHER TOTAL OTHER DIRECT COSTS H. TOTAL DIRECT COSTS (A THROUGH G) I. INDIRECT COSTS (F&A)(SPECIFY RATE AND BASE)

SUMR

Funds Requested By proposer

Funds granted by NSF (if different)

0.50 $

7,663 $

0.00 0.50

0 7,663

0.00 0.00

39,200 15,000 11,520 5,500 0 0 78,883 18,390 97,273

3,000

3,000 12,000 0

0

TOTAL PARTICIPANT COSTS

5,600 0 0 2,111 0 0 7,711 119,984

Modified Total Direct Costs (MTDC) (Rate: 44.5000, Base: 116984) TOTAL INDIRECT COSTS (F&A) J. TOTAL DIRECT AND INDIRECT COSTS (H + I) K. RESIDUAL FUNDS L. AMOUNT OF THIS REQUEST (J) OR (J MINUS K) M. COST SHARING PROPOSED LEVEL $ PI/PD NAME

Vicente M Reyes ORG. REP. NAME*

52,058 172,042 0 172,042 $

0

$ AGREED LEVEL IF DIFFERENT $ FOR NSF USE ONLY INDIRECT COST RATE VERIFICATION Date Checked

Date Of Rate Sheet

fm1030rs-07

Initials - ORG

1 *ELECTRONIC SIGNATURES REQUIRED FOR REVISED BUDGET

SUMMARY PROPOSAL BUDGET

YEAR

2

FOR NSF USE ONLY PROPOSAL NO. DURATION (months) Proposed Granted AWARD NO.

ORGANIZATION

Rochester Institute of Tech PRINCIPAL INVESTIGATOR / PROJECT DIRECTOR

Vicente M Reyes A. SENIOR PERSONNEL: PI/PD, Co-PI’s, Faculty and Other Senior Associates (List each separately with title, A.7. show number in brackets)

NSF Funded Person-months

CAL

ACAD

1. Vicente M Reyes - PI 0.00 0.50 2. 3. 4. 5. 6. ( 0 ) OTHERS (LIST INDIVIDUALLY ON BUDGET JUSTIFICATION PAGE) 0.00 0.00 7. ( 1 ) TOTAL SENIOR PERSONNEL (1 - 6) 0.00 0.50 B. OTHER PERSONNEL (SHOW NUMBERS IN BRACKETS) 1. ( 1 ) POST DOCTORAL SCHOLARS 11.76 0.00 2. ( 1 ) OTHER PROFESSIONALS (TECHNICIAN, PROGRAMMER, ETC.) 6.00 0.00 3. ( 4 ) GRADUATE STUDENTS 4. ( 2 ) UNDERGRADUATE STUDENTS 5. ( 0 ) SECRETARIAL - CLERICAL (IF CHARGED DIRECTLY) 6. ( 0 ) OTHER TOTAL SALARIES AND WAGES (A + B) C. FRINGE BENEFITS (IF CHARGED AS DIRECT COSTS) TOTAL SALARIES, WAGES AND FRINGE BENEFITS (A + B + C) D. EQUIPMENT (LIST ITEM AND DOLLAR AMOUNT FOR EACH ITEM EXCEEDING $5,000.)

TOTAL EQUIPMENT E. TRAVEL 1. DOMESTIC (INCL. CANADA, MEXICO AND U.S. POSSESSIONS) 2. FOREIGN

F. PARTICIPANT SUPPORT COSTS 0 1. STIPENDS $ 0 2. TRAVEL 0 3. SUBSISTENCE 0 4. OTHER TOTAL NUMBER OF PARTICIPANTS ( 0) G. OTHER DIRECT COSTS 1. MATERIALS AND SUPPLIES 2. PUBLICATION COSTS/DOCUMENTATION/DISSEMINATION 3. CONSULTANT SERVICES 4. COMPUTER SERVICES 5. SUBAWARDS 6. OTHER TOTAL OTHER DIRECT COSTS H. TOTAL DIRECT COSTS (A THROUGH G) I. INDIRECT COSTS (F&A)(SPECIFY RATE AND BASE)

SUMR

Funds Requested By proposer

Funds granted by NSF (if different)

0.50 $

7,893 $

0.00 0.50

0 7,893

0.00 0.00

40,376 15,450 11,520 5,500 0 0 80,739 19,512 100,251

0 3,000 0

0

TOTAL PARTICIPANT COSTS

0 2,000 0 2,111 0 0 4,111 107,362

Modified Total Direct Costs (MTDC) (Rate: 44.5000, Base: 107362) TOTAL INDIRECT COSTS (F&A) J. TOTAL DIRECT AND INDIRECT COSTS (H + I) K. RESIDUAL FUNDS L. AMOUNT OF THIS REQUEST (J) OR (J MINUS K) M. COST SHARING PROPOSED LEVEL $ PI/PD NAME

Vicente M Reyes ORG. REP. NAME*

47,776 155,138 0 155,138 $

0

$ AGREED LEVEL IF DIFFERENT $ FOR NSF USE ONLY INDIRECT COST RATE VERIFICATION Date Checked

Date Of Rate Sheet

fm1030rs-07

Initials - ORG

2 *ELECTRONIC SIGNATURES REQUIRED FOR REVISED BUDGET

SUMMARY PROPOSAL BUDGET

YEAR

3

FOR NSF USE ONLY PROPOSAL NO. DURATION (months) Proposed Granted AWARD NO.

ORGANIZATION

Rochester Institute of Tech PRINCIPAL INVESTIGATOR / PROJECT DIRECTOR

Vicente M Reyes A. SENIOR PERSONNEL: PI/PD, Co-PI’s, Faculty and Other Senior Associates (List each separately with title, A.7. show number in brackets)

NSF Funded Person-months

CAL

ACAD

1. Vicente M Reyes - PI 0.00 0.50 2. 3. 4. 5. 6. ( 0 ) OTHERS (LIST INDIVIDUALLY ON BUDGET JUSTIFICATION PAGE) 0.00 0.00 7. ( 1 ) TOTAL SENIOR PERSONNEL (1 - 6) 0.00 0.50 B. OTHER PERSONNEL (SHOW NUMBERS IN BRACKETS) 1. ( 1 ) POST DOCTORAL SCHOLARS 11.76 0.00 2. ( 1 ) OTHER PROFESSIONALS (TECHNICIAN, PROGRAMMER, ETC.) 6.00 0.00 3. ( 4 ) GRADUATE STUDENTS 4. ( 2 ) UNDERGRADUATE STUDENTS 5. ( 0 ) SECRETARIAL - CLERICAL (IF CHARGED DIRECTLY) 6. ( 0 ) OTHER TOTAL SALARIES AND WAGES (A + B) C. FRINGE BENEFITS (IF CHARGED AS DIRECT COSTS) TOTAL SALARIES, WAGES AND FRINGE BENEFITS (A + B + C) D. EQUIPMENT (LIST ITEM AND DOLLAR AMOUNT FOR EACH ITEM EXCEEDING $5,000.)

TOTAL EQUIPMENT E. TRAVEL 1. DOMESTIC (INCL. CANADA, MEXICO AND U.S. POSSESSIONS) 2. FOREIGN

F. PARTICIPANT SUPPORT COSTS 0 1. STIPENDS $ 0 2. TRAVEL 0 3. SUBSISTENCE 0 4. OTHER TOTAL NUMBER OF PARTICIPANTS ( 0) G. OTHER DIRECT COSTS 1. MATERIALS AND SUPPLIES 2. PUBLICATION COSTS/DOCUMENTATION/DISSEMINATION 3. CONSULTANT SERVICES 4. COMPUTER SERVICES 5. SUBAWARDS 6. OTHER TOTAL OTHER DIRECT COSTS H. TOTAL DIRECT COSTS (A THROUGH G) I. INDIRECT COSTS (F&A)(SPECIFY RATE AND BASE)

SUMR

Funds Requested By proposer

Funds granted by NSF (if different)

0.50 $

8,130 $

0.00 0.50

0 8,130

0.00 0.00

41,587 15,914 11,520 5,500 0 0 82,651 20,686 103,337

0 10,000 0

0

TOTAL PARTICIPANT COSTS

0 2,000 0 2,111 0 0 4,111 117,448

Modified Total Direct Costs (MTDC) (Rate: 44.5000, Base: 117448) TOTAL INDIRECT COSTS (F&A) J. TOTAL DIRECT AND INDIRECT COSTS (H + I) K. RESIDUAL FUNDS L. AMOUNT OF THIS REQUEST (J) OR (J MINUS K) M. COST SHARING PROPOSED LEVEL $ PI/PD NAME

Vicente M Reyes ORG. REP. NAME*

52,264 169,712 0 169,712 $

0

$ AGREED LEVEL IF DIFFERENT $ FOR NSF USE ONLY INDIRECT COST RATE VERIFICATION Date Checked

Date Of Rate Sheet

fm1030rs-07

Initials - ORG

3 *ELECTRONIC SIGNATURES REQUIRED FOR REVISED BUDGET

SUMMARY PROPOSAL BUDGET

Cumulative FOR NSF USE ONLY PROPOSAL NO. DURATION (months) Proposed Granted AWARD NO.

ORGANIZATION

Rochester Institute of Tech PRINCIPAL INVESTIGATOR / PROJECT DIRECTOR

Vicente M Reyes A. SENIOR PERSONNEL: PI/PD, Co-PI’s, Faculty and Other Senior Associates (List each separately with title, A.7. show number in brackets)

NSF Funded Person-months

CAL

ACAD

1. Vicente M Reyes - PI 0.00 1.50 2. 3. 4. 5. 6. ( ) OTHERS (LIST INDIVIDUALLY ON BUDGET JUSTIFICATION PAGE) 0.00 0.00 7. ( 1 ) TOTAL SENIOR PERSONNEL (1 - 6) 0.00 1.50 B. OTHER PERSONNEL (SHOW NUMBERS IN BRACKETS) 1. ( 3 ) POST DOCTORAL SCHOLARS 35.28 0.00 2. ( 3 ) OTHER PROFESSIONALS (TECHNICIAN, PROGRAMMER, ETC.) 18.00 0.00 3. ( 12 ) GRADUATE STUDENTS 4. ( 6 ) UNDERGRADUATE STUDENTS 5. ( 0 ) SECRETARIAL - CLERICAL (IF CHARGED DIRECTLY) 6. ( 0 ) OTHER TOTAL SALARIES AND WAGES (A + B) C. FRINGE BENEFITS (IF CHARGED AS DIRECT COSTS) TOTAL SALARIES, WAGES AND FRINGE BENEFITS (A + B + C) D. EQUIPMENT (LIST ITEM AND DOLLAR AMOUNT FOR EACH ITEM EXCEEDING $5,000.)

$

TOTAL EQUIPMENT E. TRAVEL 1. DOMESTIC (INCL. CANADA, MEXICO AND U.S. POSSESSIONS) 2. FOREIGN

F. PARTICIPANT SUPPORT COSTS 0 1. STIPENDS $ 0 2. TRAVEL 0 3. SUBSISTENCE 0 4. OTHER TOTAL NUMBER OF PARTICIPANTS ( 0) G. OTHER DIRECT COSTS 1. MATERIALS AND SUPPLIES 2. PUBLICATION COSTS/DOCUMENTATION/DISSEMINATION 3. CONSULTANT SERVICES 4. COMPUTER SERVICES 5. SUBAWARDS 6. OTHER TOTAL OTHER DIRECT COSTS H. TOTAL DIRECT COSTS (A THROUGH G) I. INDIRECT COSTS (F&A)(SPECIFY RATE AND BASE) TOTAL INDIRECT COSTS (F&A) J. TOTAL DIRECT AND INDIRECT COSTS (H + I) K. RESIDUAL FUNDS L. AMOUNT OF THIS REQUEST (J) OR (J MINUS K) M. COST SHARING PROPOSED LEVEL $ PI/PD NAME

Vicente M Reyes ORG. REP. NAME*

SUMR

Funds Requested By proposer

Funds granted by NSF (if different)

1.50 $

23,686 $

0.00 1.50

0 23,686

0.00 0.00

121,163 46,364 34,560 16,500 0 0 242,273 58,588 300,861

3,000

3,000 25,000 0

0

TOTAL PARTICIPANT COSTS

5,600 4,000 0 6,333 0 0 15,933 344,794

152,098 496,892 0 496,892 $

0

$ AGREED LEVEL IF DIFFERENT $ FOR NSF USE ONLY INDIRECT COST RATE VERIFICATION Date Checked

Date Of Rate Sheet

fm1030rs-07

Initials - ORG

C *ELECTRONIC SIGNATURES REQUIRED FOR REVISED BUDGET

Budget Justification This project will begin on September 1, 2011, and it will end on August 31, 2014. All of the work for this project will take place at RIT. Senior Personnel: Salaries and Wages Dr. Vicente Reyes (PI), Assistant Professor of Biological Sciences, will devote 0.5 academic month of effort and 0.5 summer month of effort to the project per year. We request NSF funds for the salary associated with this effort. Dr. Reyes will devote additional effort as necessary, funded by RIT, to accomplish the goals of the project. Dr. Reyes will supervise the students (graduate and undergraduate) as well as post-doc and technician in his group. He will write any necessary Fortran programs to implement the required algorithms for his projects and have the students and post-doc test them on various datasets. He will also write the outlines for manuscripts and reviews for publication, and let his students and post-doc write the details, as part of their training. Dr. Reyes works at RIT on a 9.5-month academic contract. The salary request for his effort in Year 1 is based upon his projected 2010-2011 annual base salary. Summer salary at RIT is calculated as the prior academic year’s annual salary, multiplied by 26.3%, multiplied by the proportion of the summer that the faculty member is working on the project. In the case of this grant proposal, the proportion of the summer that Dr. Reyes is working is 0.5/2.5 months. The salary request for Dr. Reyes has been incremented by 3% per year, due to inflation. Other Personnel: Salaries and Wages For each year of the project period, we request salary funds for a part-time technician/computer programmer (to-be-hired) who will devote 6 calendar months of effort per year to the project (starting salary: $15,000/year). The technician/programmer will be responsible for writing various scripts (UNIX Cshell, Perl, etc.), creating web applications and web pages for the group, maintenance of our various databases, among others. In addition, we request funds for a full-time postdoctoral research associate (to-be-hired), with an annual starting salary of $40,000 per year. This postdoctoral research associate will devote 11.76 calendar months of effort to the project per year and will be responsible for getting novel research projects going, write pertinent reviews and papers for his/her own career development, help the undergraduate and graduate students in the group as needed, etc. The salaries for the part-time technician/computer programmer and for the postdoctoral research associate have been incremented by 3% per year, due to inflation. Furthermore, we request funds for 4 part-time bioinformatics master’s students to each work 12 hours per week, at $12/hour, for 20 weeks, per year ($11,520 per year). These bioinformatics students will have their own individual projects as theses, as assigned and outlined by the PI. Due to the students’ part-time status at RIT and RIT’s regulations, these students will be hired as temporary employees for this role. Different students will be paid for their work on the project each year. We request funds for an undergraduate student to work during each academic year, for 10 hours/week, for 30 weeks, at $11/hour ($3,300 per year). This student will carry out a project in the area of computational structural biology, the main area of the PI's research program. Also, we request funds for an undergraduate student to work during each summer, for 20 hours/week, for 10 weeks, at $11/hour ($2,200 per year). This student will also be working on a project in the area of computational structural biology.

i

Fringe Benefits We request fringe benefits for Dr. Reyes for each academic year, at 29.6% for Y1, 30.6% for Y2, and 31.6% for Y3 (reflecting an escalation of 100 basis points per year), and at 7.9% per year for each summer. In addition, we request fringe benefits for the part-time technician/computer programmer and for the postdoctoral research associate, at 29.6% for Y1, 30.6% for Y2, and 31.6% for Y3 (reflecting an escalation of 100 basis points per year). In addition, we request fringe benefits for the temporary employees at 7.9% per year. All fringe benefits are based upon RIT’s federally negotiated benefit rates for work on federally sponsored projects (DHHS rate agreement, effective 07/01/09-06/30/10 and provisional until new rates are negotiated). Capital Equipment We request funding for three terabytes of storage space, in order to store datasets, generated data results and some necessary software at the cost of $3,000, in Year 1. Travel We request $4,000 in funds for the PI to travel to Urbana-Champaign, IL to attend the Advanced Mathematics Workshop in Year 1. This will cover tuition, airfare, lodging, and meals. At this workshop, the PI will learn about some specialized abstract mathematical tools available in Mathematica (including NKS, etc.) but not by other means, which will be applied to the PI's current and future projects in the field of computational structural biology. A good example will be the application of non-Euclidean geometries to the representation and analysis of macromolecular structures. In addition, we request $2000 in funds for the PI to attend the Gordon conference in Year 3 of the project period (including registration, transportation, lodging, and meals), in order to present the research results from his group and get feedback from his peers, as well as share his knowledge and expertise in providing feedback on the results from other research groups. Also, we request $1,500 in domestic travel funds per year for the PI to travel to a professional meeting, such as the computational biology and bioinformatics conferences sponsored by the International Society for Computational Biology (ISCB). This will include registration, transportation, lodging, and meals. Furthermore, we request funds for the PI and for 4 master’s students to attend the th SUNY Albany 17 Conversation Conference during Year 1 and Year 3 of the project period ($1000/person/trip x 5 travelers per year = $5,000 per year for Year 1 and Year 3). This will include registration, transportation, lodging, and meals. We request $1500 per year in domestic travel funds for the postdoctoral research associate to attend one professional meeting per year (e.g., computational biology and bioinformatics conferences sponsored by the ISCB). Other Direct Costs We request $1,000 for software (such as the most current versions of Mathematica and MATLAB) in Year 1. Also, we request $3,000 in Year 1 for RAM+CPU upgrades for computers used for the project. In addition, we request two graphics monitors (one for the PI’s office, and one for use of his master’s students on the project) at $800 each (total: $1,600) in Year 1. The purpose of the graphics monitors is to allow us to use our visualization tools for macromolecular 3D structures efficiently and with reasonable speed.

ii

Furthermore, we request $2,000 in funds for page charges for journal articles per year in Years 2 and 3, in order for the PI to be able to publish and disseminate the project results. We request Information Technology and Service (ITS) charges at RIT’s rate on federally sponsored projects of $88.70/FTE/month. This covers services such as the maintenance of email accounts. ITS charges do not apply to faculty summer salary. Indirect (Facilities and Administrative) Costs RIT’s federally-negotiated indirect cost rate (U.S. Department of Health and Human Services, effective 07/01/09 and forward until a new rate is negotiated) is 44.5% of modified total direct costs (total direct costs - equipment over $1500 – Participant Support Costs – Subawards in Excess of $25,000). A copy of RIT’s indirect cost rate agreement is available upon request. Total Request to NSF: $496,892

iii

Current and Pending Support (See GPG Section II.C.2.h for guidance on information to include on this form.) The following information should be provided for each investigator and other senior personnel. Failure to provide this information may delay consideration of this proposal.

Other agencies (including NSF) to which this proposal has been/will be submitted.

Investigator: Vicente Reyes Support:

Current

Pending

Submission Planned in Near Future

*Transfer of Support

Project/Proposal Title: F.E.A.D. Summer Salary Award

College of Science, RIT Source of Support: Total Award Amount: $ 3,000 Total Award Period Covered: 07/01/10 - 08/31/10 Location of Project: RIT Person-Months Per Year Committed to the Project. Cal:0.00 Acad: 0.00 Sumr: 0.50 Support:

Current

Pending

Submission Planned in Near Future

*Transfer of Support

Project/Proposal Title: ABI Innovation: Use of Spherical and Cylindrical Coordinate

Systems to Represent Protein 3D Structures: Applications to Epitope Mapping and Ligand Binding Site Prediction NSF Source of Support: Total Award Amount: $ 496,893 Total Award Period Covered: 09/01/11 - 08/31/14 Location of Project: RIT Person-Months Per Year Committed to the Project. Cal:0.00 Acad: 0.50 Sumr: 0.50 Support:

Current

Pending

Submission Planned in Near Future

*Transfer of Support

Project/Proposal Title: ABI Innovation: Use of a Reduced Protein Representation for

the Modeling and Screening of Ligand Binding Sites: A Structure-Based Protein Function Prediction Method NSF Source of Support: Total Award Amount: $ 496,893 Total Award Period Covered: 09/01/11 - 08/31/14 Location of Project: RIT Person-Months Per Year Committed to the Project. Cal:0.00 Acad: 0.50 Sumr: 0.50 Support:

Current

Pending

Submission Planned in Near Future

*Transfer of Support

Project/Proposal Title:

Source of Support: Total Award Amount: $ Total Award Period Covered: Location of Project: Person-Months Per Year Committed to the Project. Cal: Acad: Support:

Current

Pending

Submission Planned in Near Future

Sumr: *Transfer of Support

Project/Proposal Title:

Source of Support: Total Award Amount: $ Total Award Period Covered: Location of Project: Person-Months Per Year Committed to the Project. Cal: Acad:

Summ:

*If this project has previously been funded by another agency, please list and furnish information for immediately preceding funding period.

Page G-1

USE ADDITIONAL SHEETS AS NECESSARY

1

Facilities, Equipment and Other Resources. Faculty of the Bioinformatics Program at RIT have exclusive access to a variety of computational resources. These include a number of dedicated computing systems such as several Sun Enterprise 450 servers: one has 4GB of RAM memory and 432 GB of hard drive storage space, two others have 4GB of RAM and approximately 1 TB of internal storage. Additionally we have two Dell PowerEdge 1850 servers with dual dual core Xeon processors, 4GB of RAM and 600GB of internal RAID 0 storage and, one Dell PowerEdge 1950 with dual dual core Xeon processors, 4GB of RAM and 300GB of storage, and one SunFire 2100 dual core Opteron machine with 2GB of RAM and 160GB of internal storage which is a dediated BLAST and e-mail server. That network is attached to an additional 9.1TB of storage (NAS) for database housing and maintenance with an additional 3.6TB of storage located in a separate facility. We also have one machine dedicated to monitoring network and server health and one serving as a bridge firewall. RIT has multiple means of external network connectivity including 400Mb/s and 200Mb/s Internet2 connections routed via Gigabit Ethernet and a 45Mb/s T3 connection for backup. RIT owns and operates its own Dense Wave Division Multiplexing (DWDM) network for connecting the Gigabit Ethernet connections to our ISPs. The DWDM network (as configured) has the carrying capability of 32 10Gb/s channels (lambdas) to points of interest from the RIT community. Wireless connectivity is ubiquitous throughout campus. RIT also has access to the program SAS through a site license. The author has access as well to various software and approx. 10 Tb of data storage at the department level (http:// www. adobo. bioinformatics. rit. edu), and to even more computing resources and memory space at RIT's I.T. Collaboratory (http:// www. rit. edu/ research/ itc/) and Research Computing (http:// rc. rit. edu/). RIT’s office of Sponsored Programs Accounting will handle project accounting and budgetary procedures. The budget will be monitored via RIT’s Oracle System, which tracks all expenses. RIT’s office of Sponsored Research Services will manage grant reporting.

1

Postdoctoral Mentoring Plan: In order to ensure that the "postdoctoral product" of the PI is of high quality and fully capable of becoming an integral part of the next generation of independent and multidisciplinary scientific researchers -- be it in the academe, government or industry -the following steps will be taken:

• he/she will be encouraged to sit in or audit courses at RIT related to computational

biology or bioinformatics in which he/she has little or no background • he/she will be required to disseminate his/her scientific results in peer-reviewed scientific journals and conferences; • he/she will be asked to actively participate in the writing/preparation of grant proposals by the PI; • he/she will be encouraged to take advantage of career counseling services offered by RIT; • he/she will be asked occasionally to advise/guide research projects of graduate and undergraduate students in the group; • he/she will be strongly encouraged to attend courses and/or workshops in scientific and professional ethics; • he/she will be trained in the scientific method and way of thinking by providing him/her with a lot of experience in posing the right questions when solving an open, broad scientific problem. The office of Sponsored Research Services at RIT conducts extensive organized training for principal investigators and other personnel involved in externally sponsored projects. In the 2009-2010 academic year, sponsored research staff organized over fifty hours of training addressing topics including peer review, funding agency overviews, compliance, intellectual property, budgeting and others. These sessions were attended by 250 individuals. Post-docs are strongly encouraged to attend these training events. Additionally, Sponsored Research Services and the office of Teaching and Learning Services at RIT sponsor and run an annual Grant Writers’ Boot Camp, an intensive twoday session in the fundamentals of grant writing and peer review. Participants, including post-docs, come prepared with proposals for internal seed funding awards that are reviewed and revised over the course of the program. The Principal Investigator will work with the post-doc on an individualized professional development plan to take advantage of these and other services at RIT.