Development of algorithms for peptide identification from ... - CiteSeerX

3 downloads 13482 Views 4MB Size Report
converts and focuses the ions with a concave conversion dynode placed off axis at a .... electron capture dissociation (ECD), by post source decay (PSD), and by ...... filtering facilities and documentation of the settings utilized, in XML format.
Biologie

Dissertationsthema

Development of algorithms for peptide identification from mass spectrometric data in genomic databases

Inaugural-Dissertation zur Erlangung des Doktorgrades der Naturwissenschaften im Fachbereich Biologie der Mathematisch-Naturwissenschaftlichen Fakultät der Westfälischen Wilhelms-Universität Münster

vorgelegt von Jens Allmer aus Warendorf - 2006 -

Dekanin/Dekan:

Prof. Dr. N. Sachser

Erster Gutachter:

Prof. Dr. M. Hippler

Zweiter Gutachter:

Prof. Dr. E. Bornberg-Bauer

Tag der mündlichen Prüfung: Tag der Promotion:

To my wife Açalya and my parents.

Index

I

Development of algorithms for peptide identification from mass spectrometric data in genomic databases I

INTRODUCTION........................................................................................................................ 1 1

PROTEOMICS .............................................................................................................................. 1

2

GENE PREDICTION ..................................................................................................................... 4

3

MASS SPECTROMETRY ............................................................................................................... 6 3.1

Fractionation ................................................................................................................... 7

3.2

Mass Spectrometer .......................................................................................................... 8

3.2.1

Ionization .................................................................................................................................. 9

3.2.2

Ion Transfer and Focusing....................................................................................................... 11

3.2.3

Ion-Trap .................................................................................................................................. 11

3.2.4

Ion Detection........................................................................................................................... 12

3.3 3.3.1

4

5

7 II

MS/MS .................................................................................................................................... 17

DATABASE SEARCHING ............................................................................................................ 20 4.1

Mascot ........................................................................................................................... 22

4.2

Sequest........................................................................................................................... 23

DE NOVO SEQUENCING ............................................................................................................ 25 5.1

6

Mass Spectrometry ........................................................................................................ 13

PEAKS ........................................................................................................................... 27

LIMITATIONS............................................................................................................................ 29 6.1

Quality of MS/MS Spectra ............................................................................................. 29

6.2

Prediction and Identification Accuracy ......................................................................... 30

6.3

Sequence Availability in Databases............................................................................... 31

SPECIFIC AIMS ......................................................................................................................... 33 MATERIALS AND METHODS .............................................................................................. 34

1

EXPERIMENTAL DESIGN ........................................................................................................... 34

2

EXPERIMENTAL PROCEDURES .................................................................................................. 35

3

2.1

1D SDS PAGE ............................................................................................................... 35

2.2

2D SDS PAGE ............................................................................................................... 36

COMPUTATIONAL PROCESSING ................................................................................................ 36 3.1

PEAKS ........................................................................................................................... 37

3.2

Sequest........................................................................................................................... 37

3.3

Query Creation .............................................................................................................. 38

3.4

The GenomicPeptideFinder........................................................................................... 38

3.5

AutoMS .......................................................................................................................... 39

3.6

Database........................................................................................................................ 39

4

ASSESSMENT OF DE NOVO SEQUENCING ALGORITHMS............................................................ 41

5

GENOMICPEPTIDEFINDER IMPLEMENTATIONS ......................................................................... 43

6

CONFIDENCE CALCULATION .................................................................................................... 43

Index

7 III

II

PROTEIN IDENTIFICATION ........................................................................................................ 47 PUBLICATIONS ....................................................................................................................... 48

1

IV

OVERVIEW OVER THE INCLUDED PUBLICATIONS ..................................................................... 48 1.1

Publication 1.................................................................................................................. 50

1.2

Publication 2.................................................................................................................. 62

1.3

Publication 3.................................................................................................................. 70

1.4

Publication 4.................................................................................................................. 76

DISCUSSION ........................................................................................................................... 116 1

2

GENE STRUCTURE .................................................................................................................. 117 1.1

Gene Annotation .......................................................................................................... 117

1.2

Alternative Splicing ..................................................................................................... 119

1.3

Visualization ................................................................................................................ 120

PROTEIN IDENTIFICATION ...................................................................................................... 120 2.1

Confident Protein Identification .................................................................................. 121

2.2

Protein Identification Using Complementing Algorithms............................................ 121

2.3

One-Hit Wonders......................................................................................................... 122

3

BRIDGING THE GAP ................................................................................................................ 123

4

OUTLOOK ............................................................................................................................... 125 4.1

Quality of MS/MS Spectra and Mass Spectrometer Resolution................................... 125

4.2

De Novo Sequencing Accuracy ................................................................................... 126

4.3

GenomicPeptideFinder................................................................................................ 126

4.4

Visualization ................................................................................................................ 127

V

SUMMARY .............................................................................................................................. 128

VI

REFERENCES......................................................................................................................... 129

VII

APPENDICES ..................................................................................................................... 141

1

APPENDIX A........................................................................................................................... 141

2

APPENDIX B........................................................................................................................... 192

3

APPENDIX C........................................................................................................................... 197 3.1

Websites....................................................................................................................... 197

3.2

Source Codes ............................................................................................................... 197

3.3

Manuals ....................................................................................................................... 198

3.4

Papers.......................................................................................................................... 198

4

ABBREVIATIONS .................................................................................................................... 199

VIII

CURRICULUM VITAE ..................................................................................................... 201

IX

ACKNOWLEDGEMENTS..................................................................................................... 204

Introduction

I

1

Introduction

Identification of peptides from mass spectrometric data which are split by an intron when deduced from genomic DNA is the main problem addressed in this dissertation. It is influenced by many fields which I will merely outline, or deal with in-depth, depending on their importance for the problem and its solution. The introduction is therefore organized in several sections which each tackle one of the affected areas. First the biological impact is introduced in Section I1. Following this, a currently widespread method to deal with gene structure is described in Section I2. Thereafter, the data acquisition method is discussed in Section I3. Sections I4 and I5 deal with means to process and evaluate the gathered data which directly influence or are affected by the problem stated in Section I6. Section I7, finally, instantiates my proposed solution to the problem stated before.

1

Proteomics

The sequencing of many genomes has reached completion (Bonizzoni et al., 2005). Furthermore, the mouse, human, and other genomes have been covered multiple times by shotgun sequencing. The same is true for the genome of the eukaryotic, unicellular, green alga Chlamydomonas reinhardtii, which has a nine fold coverage and is now available in its third assembly (http://www.jgi.doe.gov/). However, having a completely sequenced genome is not enough to elucidate biological function (Pandey and Mann, 2000). The genome may be regarded as the static blueprint of an organism. In this case, the proteome can be regarded as more than a full complement to a genome. Furthermore, the proteome cannot be considered as static. Proteins may be differentially expressed due to environmental and developmental conditions for example.

They

may

be

post-translationally

modified

by

glycolysation,

phosphorylation, and proteolytic cleavage, to mention only a few possibilities (Mann and Jensen, 2003; Peng et al., 2003; Searle et al., 2005; Shadforth et al., 2005b; Taylor et al., 2003). Proteins may also be unevenly distributed in space and time (Baginsky and Gruissem, 2004; Greenbaum et al., 2003; Taylor et al., 2003; White et al., 2004), thus their interactions with other proteins may vary under these circumstances which may in turn also influence their function (Bock and Gough,

Introduction

2

2003; Peng et al., 2003). All these different species of proteins build up the proteome of an organism. In the following, the term apparent proteome will be used to define the entirety of proteins that can be found in a specific sample. The broader term proteome will be used to describe the entirety of protein species that are coded by a genome, including all alternatively spliced iso-forms, and all their potential modifications. Functional aspects of proteins are left out in this definition, since they belong to a different level of abstraction. The proteome has thus to be viewed as a complex mixture of modified and un-modified proteins. Compared to this, the genome represents a less complex entity. The proteome not only varies in-between organisms, but also from cell to cell within one organism. The timely and spacial distributions of the proteins have to be taken into account, additionally. In analogy to genomics, the term proteomics was coined to describe the study of the proteome and related aspects. An early definition of proteomics states that it is a technology with which the entire protein complement of a sample could be described (Wilkins et al., 1996). Other more recent definitions include functional and interactional aspects (McLaughlin et al., 2006; Pandey and Mann, 2000). Proteomics will here be defined as the entirety of available methods to study the proteome, including all computational techniques to process and evaluate data generated by these methods, time and spacial expression patterns, protein-protein interactions and interactions with other molecules, as well as protein functions. The present study will also show that proteomics and genomics cannot be separated, but must be advanced in parallel. Thus, proteome data must be mapped back to the genome and progress in the understanding of the genome can directly aid in proteomic studies (Mann et al., 2001), if only through enhanced gene annotation. As there are a huge number of methods available for proteomics, not all of them can be examined here. Historically, two dimensional (2D) polyacrylamide gel electrophoresis (PAGE) was a powerful technique for proteomics (Fey and Larsen, 2001). New advances in liquid chromatography (LC) challenge this method more and more (Jensen et al., 1999; Link et al., 1999; Peng et al., 2003). Neither 2D PAGE nor LC or multidimensional LC can however directly identify proteins. Edman degradation used to be the tool of choice for protein sequencing and thus protein identification, but it is rapidly swamped out by mass spectrometry (Pandey and Mann, 2000). Now, due to speed, automation, and cost advantages,

Introduction

3

proteins are usually identified by mass spectrometry (MS) and computational evaluation of the resulting data (Bafna and Reinert, 2004; Mann and Jensen, 2003; Mann and Pandey, 2001; Shadforth et al., 2005b). In the present study the research organism was Chlamydomonas reinhardtii which is an emerging model system for proteomic research (Stauber and Hippler, 2004) it also allows the investigation of protein-protein interactions (Sommer, 2003). One important aspect that particularly qualifies C. reinhardtii in the context of this study is that its genes display complex intron-exon structures (see below). Enriched thylakoid membranes were isolated from crude cell extracts of the green alga Chlamydomonas reinhardtii under both, iron sufficient and iron deficient growth conditions (Moseley et al., 2002). This extract was separated via sucrose density centrifugation. Then both, one dimensional (1D) and 2D PAGE were performed with the thylakoid fractions of the sucrose gradient. The spots and bands were excised, digested in-gel with trypsin, and the resulting peptides were then submitted to mass spectrometry via liquid chromatography, here a reverse phase (RP) column was used. The resulting spectra were analyzed by a set of computational tools. Sequest, a database search software, was used to correlate mass spectra to sequence databases (Eng et al., 1994). The PEAKS algorithm was used to perform de novo amino acid sequencing of the mass spectra (Ma et al., 2003). The GenomicPeptideFinder (GPF), an algorithm devised within the present study, was used to match the de novo predictions with the genomic database of C. reinhardtii (Allmer et al., 2004). With the advent of genomics and rising availability of genetic sequences, the need for defining gene structure became evident. For higher eukaryotes with complex gene structure, where exons are frequently discontinued by introns, computational tools emerged that promised to predict the correct gene structure. Since this study especially focuses on the identification of peptides split by introns when deduced from genomic DNA the vast field of gene prediction will be contoured in the following section.

Introduction

2

4

Gene Prediction

Today, genomic data is still more readily available than proteomic data. One research focus is therefore to assign gene structure to genomic data by computational analysis alone. One area is gene prediction, which tries to predict intron-exon structure in genomic sequences. Publicly available genomic databases, such as presented at the joint genome institute (JGI, http://www.jgi.doe.gov/), accompany their information with predicted gene models. Since gene prediction started with the availability of genomic sequences, numerous tools that perform gene prediction have been developed. Gene prediction that only uses information contained in the nucleotide sequence to assign gene structure is called ab initio gene prediction. Some tools that perform ab initio gene prediction are, GENEID (Guigo et al., 1992), GENIE (Kulp et al., 1996), GENSCAN (Burge and Karlin, 1997), HMMGENE (Krogh, 1997), FGENESH (Salamov and Solovyev, 2000), GeneSplicer (Pertea et al., 2001), and EazyGene (Larsen and Krogh, 2003), to name only a small fraction of the available tools. These software applications mostly differ in the algorithm employed for gene finding, which are usually learning algorithms based on neural networks, hidden Markov models or other, related algorithms. New developments promise to significantly increase the precision of gene prediction. These tools combine gene prediction with sequence alignment of related species. Tools that use two genomes for such an analysis are for example SGP-2 (Parra et al., 2003), SLAM (Alexandersson et al., 2003), and TWINSCAN (Flicek et al., 2003; Korf et al., 2001). Naturally it would be of interest to include more than two genomes in these assays. These approaches are being developed at the moment, but are not yet practically utilizable (Brent and Guigo, 2004). Despite all these efforts, only 15-20% of known gene structures are predicted correctly (Flicek et al., 2003; Parra et al., 2003). Inline with this, recent assays of the accuracy of gene predictions also revealed that their precision is not very high (Reboul et al., 2003). At the present, gene prediction faces a number of limitations, one which strongly underlines the need for experimental evidence, is that gene predictors only predict one splice form per protein (Brent and Guigo, 2004). In contrast to that, it is widely recognized that in organisms with complex genomes, alternative splicing plays an important role in protein diversity. This is underlined by

Introduction

5

the estimation that 40-60% of human genes are alternatively spliced (Modrek and Lee, 2002). Therefore, the assessment of alternative splicing gains growing interest (Boue et al., 2002). Tools that try to asses alternative splicing by use of EST data are usually dependent on Blast or Blat and typically make only single EST alignments (Churbanov et al., 2005). Recently a heuristic algorithm, called ASPIC, has been proposed that overcomes these limitations by making multiple EST alignments and multiple genome alignments at the same time while employing a new method to align and score the combined information which, unlike Blast or Blat, was specifically designed for this purpose (Bonizzoni et al., 2005). Another new strategy is to predict exon skipping and intron retention on the basis of Pfam domains (Bateman et al., 2004) which also creates independency of EST data (Hiller et al., 2005). Nonetheless, gene predictors, especially those that employ multiple genomes, are now at a development stage where their predictions can be used as hypotheses to drive experimental annotation (Brent and Guigo, 2004; Eyras et al., 2005). Often, available proteomic information could be used to enhance gene prediction. A tool that weighs all available information biased towards experimental data is however yet to be presented. As mentioned before, proteomics and especially protein identification heavily relies on mass spectrometry which will be introduced in the next section.

Introduction

3

6

Mass Spectrometry

Mass spectrometry of amino acid sequences is in extensive use today and is the method of choice when trying to identify peptides and proteins from complex mixtures (Aebersold and Mann, 2003; Baczek et al., 2004; Havilio et al., 2003). The complexity of the mixture and the size of the proteins are limiting factors in mass spectrometry and the subsequent identification of proteins by computational tools. Therefore, it is desirable to reduce the complexity of the mixture in order to facilitate post-spectrometric data analysis which will be considered in Sections I4 and I5. In order to identify proteins, they are first proteolytically cleaved into peptides, usually using trypsin. This adds even more complexity to the mixture (Veenstra et al., 2004), but allows for the use of methods which would be inherently more complicated using intact proteins (Nesvizhskii et al., 2003). The advantage is that complete or partial amino acid sequences can be determined for the peptides by peptide fragmentation and subsequent mass spectrometry, also called tandem MS, MS/MS, or MS2 (see Section I3.3.1). The information gained using this approach enables sophisticated database matching of the peptides (Section I4) and provides the basis for de novo sequencing attempts (Section I5). It is desirable to reduce the complexity of the sample to as few initial proteins as possible. After extraction of the apparent proteome or parts of it from a sample, it should then be fractionated as distinctively as possible before submission to mass spectrometry (Figure 1). Fractionation, including liquid chromatography (Section I3.1), and mass spectrometry (Sections I3.2 and I3.3) will be explained in the following.

Extraction

Fractionation

Mass

Protein

spectrometry

Identification

Figure 1: Abstract processes involved in an experiment that aims to identify proteins from complex mixtures via mass spectrometry and subsequent computational analysis of the mass spectra.

Introduction

3.1

7

Fractionation

As outlined above it is desirable to have an uncomplex mixture before mass spectrometric analysis. Therefore, more or less extensive fractionation is performed prior to mass spectrometry. First coarser fractionation methods may be employed to separate organelles, membranes, soluble fractions. This can for example be performed using sucrose density centrifugation. The less complex fractions from the first separation method may be fractionated further by for example filtration methods. Subsequent to this more discriminating methods of separation may be used. Gel electrophoresis is an established method to separate proteins (Anderson and Anderson, 1977; Laemmli, 1970). Both, one dimensional and two dimensional, gel electrophoresis are used today in combination with mass spectrometry. One dimensional sodium dodecyl sulfate (SDS) polyacrylamide gel electrophoresis (PAGE) separates proteins according to their molecular weight. This represents the second dimension in 2D SDS PAGE, the first being isoelectric focusing, which separates the proteins by their isoelectric point. The reproducibility of gels among researches and laboratories is however an issue (Fey and Larsen, 2001; White et al., 2004). Furthermore, it is a quite laborious method not easily incorporable in highthroughput proteomics pipelines. It involves many additional manual processing steps before mass spectrometry such as cutting of bands or spots of interest from the gel, digestion, and purification steps. This calls for new easier to handle methods of separation. Liquid chromatography (LC) is becoming more important in mass spectrometry today. A number of separation techniques have been established for protein separation using liquid chromatography. The two complementary purification methods, strong cation exchange (SCX) LC and reverse phase LC, have been used extensively (Chepanoske et al., 2005; Heller et al., 2005). A recent development enables separating of proteins by isoelectric point in a capillary (Jensen et al., 1999). Other separation methods available in LC, such as size exclusion and hydrophilic interaction enable multidimensional LC (Link et al., 1999; Peng et al., 2003). So far protein separation was described. Today, in mass spectrometry especially in MS/MS experiments, peptides, rather than proteins, are analyzed. In electro-spray ionization mass spectrometers LC has to precede ionization (Section I3.3), but matrix

Introduction

8

assisted laser desorption ionization may also be preceded by LC (Kleno et al., 2004; Pan et al., 2005). Whichever method is used, the fractions generated are the inlet for the mass spectrometer, which will be explained further in the following section.

3.2

Mass Spectrometer

A mass spectrometer separates and detects ions by their mass over charge ratio (m/z). This can be achieved by accelerating the ions in an electric field and then visualizing their ballistic trajectories by a fluorescent screen or a photographical plate. This very basic method was established by Sir Joseph John Thomson, who received the Nobel Prize for the discovery of the electron in 1906 for Physics. This discovery ultimately led to the development of the first mass spectrometer which he constructed as simple as outlined above. A wide variety of mass spectrometers is in use today, all of which have certain strengths and weaknesses (Elias et al., 2005; Grange et al., 2006; Mann et al., 2001; Pedrioli et al., 2004). It is beyond the scope of this study to asses the differences in the available instrumentation, but computational processing is influenced by a number of differences that will be referred to later in the text when the detail becomes evident (Sections I4 and I5). A mass spectrometer consists of a number of abstract building blocks. These building blocks are well defined because it is a) necessary to form ions, if they are not available a priori these ions have to be b) accelerated, to establish their mass over charge ratio, and then c) detected.

Inlet

Ionization

Ion transfer

Ion Trap

Ion detection

and focusing Mass spectrometer Figure 2: Basic abstract building blocks of a mass spectrometer. Ion flow is depicted by arrows. The two building blocks Ion-focusing and Ion-trap are optional, but are parts of the mass spectrometers used in the present study.

Introduction

9

The mass spectrometers used in this study are electro-spray ionization (ESI), quadrupole ion-trap instruments. Such an instrument basically consists of 4 building blocks (Figure 2). Each of these abstract building blocks will be explained in some more detail in the following sections (Sections I3.2.1-I3.2.4). Following the description of the building blocks of a mass spectrometer, the actual process of mass spectrometry will be further detailed in Section I3.3.

3.2.1

Ionization

In order to measure any compound with -

any mass spectrometer, this compound has to be charged and in gas phase so it

Needle

can be accelerated in an electromagnetic

++ ++ + + ++ +++

field and then analyzed. There are many

Anode

+ ++ + ++

+

Cathode

methods available to achieve ionization (Griffiths et al., 2001; Kast et al., 2003; Mann et al., 2001). Within proteomics

Figure 3: Charged peptides, in solution,

electro-spray (ESI) and matrix assisted

eluting from a reverse phase column, are

laser desorption (MALDI) are

sprayed from a needle with a fine opening

now

commonly used, although there are

by

further

accelerated by the application of an

methods

bombardment,

like

fast

electron-impact,

atom and

atmospheric

pressure

and

are

electromagnetic field. Some parts of the spray-mist are enlarged.

chemical ionization. Although MALDI is the most efficient way to ionize peptides, electro-spray ionization (ESI) was used in this study, because it has a broader application to ionizing, and vaporizing polar biomolecules (Griffiths et al., 2001) and allows an online combination with LC separation techniques. ESI was introduced as a method to analyze heavy polar biologically important molecules and their mass spectrometric analysis by John Fenn in the 1980s (Fenn et al., 1989). This work later won him a share of the 2002 Nobel Prize for chemistry.

Introduction

10

In MALDI, which is equally accredited to Karas, Hillenkamp, and Tanaka, the latter won a share of the 1983 Nobel Prize for chemistry, the sample is co-crystallized with a light absorbing matrix. By firing laser pulses at the so formed crystals, protonated and deprotonated sample molecules as well as matrix molecules become vaporized. The hovering charged molecules can than be accelerated in an electric field and are usually analyzed by time of flight (TOF). The ESI process is depicted in Figure 3. The dissolved analyte, which is partly charged in solution, is channeled to the needle, from which it sprays, forming a fine mist, into the mass spectrometer. Charged droplets are accelerated towards the opening in the counterelectrode by an electromagnetic field and by the pressure difference in-between this gas-filled chamber and the vacuum within the mass spectrometer sometimes aided by back-pressure. As the solvent traverses towards the cone, evaporation occurs, the droplet shrinks until it reaches the point that the surface tension can no longer sustain the charge at which point a "Coulombic explosion" occurs and the droplet is ripped apart. This produces a) smaller droplets that can repeat the process, as well as b) naked charged analyte molecules. These charged analyte molecules can be singly or multiply charged. The needle may be placed at an offset compared to the aperture in order to allow less solvent to enter the next section of the mass spectrometer. A gas flow countering the flow of the ion-beam can be used to enhance the process even more. Both these enhancements were not available in the MS instruments used in the present study. The so formed ion-beam enters the focusing section of the mass spectrometer through the aperture in the counter electrode. The quadrupole, which performs this focusing, will be explained in the next section.

Introduction

3.2.2

11

Ion Transfer and Focusing

The ion-beam arriving in the mass-analyzer, which is used for ion-focusing, iontransfer, and mass exclusion, consists of various peptide species with varying mass over charge ratios (see Figure 7). A quadrupole mass-analyzer consists of four parallel rods (Figure 4) that have fixed direct current (DC) and alternating

radio

frequency

Non resonant ion Resonant ion

Quadrupole rod

+

+ -

potentials applied to them. Ions produced in the source of the

~

instrument are focused and passed along the middle of the quadrupoles.

Figure 4: Mass-analyzer. Two trajectories, one of

Their motion will depend on the

a resonant (red) and one of a non resonant ion

electric fields so that only ions of a

(blue) are shown.

particular m/z ratio will be in resonance and thus pass through to reach the detector. The radio frequency is varied to bring ions of different m/z ratios into focus on the detector and thus build up a mass spectrum. This method was developed by Wolfgang Paul, who later received a share of the 1989 Nobel Prize for Physics, for the development of the first ion trap which is based on this method (Section I3.2.3).

3.2.3

Ion-Trap

Ion-traps or quadrupole ion-traps, which are basically derivations of mass-analyzers, enable trapping of ions. There are several types of ion-traps available which are still being further developed (e.g.: Yoshinari, 2000). The ions that pass through the massanalyzer can be stored in the ion-trap. The ion trap also allows ion selection, ion trapping, and more important for peptide and protein identification ion collision (Griffiths et al., 2001; Hsu et al., 2005; Zhong and Li, 2005). Collision is usually performed between selected ions and stable gas molecules such as helium, argon, or

Introduction

12

nitrogen. The collision in-between these two molecules lead to breakage of bonds in the analyte molecule. The fragmented ions can then be detected by the mass-detector. Thus a mass spectrum can be generated from a number of ions, which potentially fragment differently. This process is called collision induced dissociation (CID). The spectrum of the fragment ions of a so called parent or precursor ion is called tandem MS, MS/MS, or MS2 spectrum.

Figure 5: Schematic representation of a cyclic ion-trap (Yoshinari, 2000).

The generation of tandem MS spectra is vitally important in peptide and protein identification (see Sections I4 and I5) by database searches (Eng et al., 1994; Perkins et al., 1999) and in de novo amino acid sequencing (Baginsky and Gruissem, 2004; Dancik et al., 1999). Collision induced dissociation and other fragmentation methods are further explained in Section I3.3.1. The final step in mass spectrometry is the detection of the ions which is briefly introduced in the following section.

3.2.4

Ion Detection

After ionization, ion-focusing and ion-trapping the ions need to be visualized. In the early days of mass spectroscopy this has been achieved by use of a photographic plate. Nowadays, more sophisticated detection methods are available. Much like photo multipliers, Channeltrons, electron multiplier tubes, and microchannel plates

Introduction

13

send out an avalanche of electrons after being struck by one. A faraday cup unlike the other methods measures the ion-beam current and can only be used in analog mode which therefore renders it less sensitive. The instrument used in the present study first converts and focuses the ions with a concave conversion dynode placed off axis at a right angle to the ion-beam. The converted and focused ion-beam is than again accelerated towards the electron multiplier. The ion-beam strikes the inner wall of the electron multiplier cathode, creating a cascade of further electrons emitted by further collisions with inner surfaces of cathodes, in the end leading to a measurable signal. One aspect to keep in mind here is that there is a bias towards heavier masses which produce more pronounced signals. Now that all building blocks of a typical mass spectrometer have been described the actual process of mass spectrometry will be the focus of the next section.

3.3

Mass Spectrometry

If all peptides from a protein digestion of a complex mixture would be subjected to mass spectrometry at the same time, the result would be an overloaded and impossible to analyze spectrum comprised of a huge number of peaks. Therefore, a separation step is introduced prior to mass spectrometry (Figure 6). The peptides are channeled through a reverse phase column to the ESI source of the mass spectrometer. The RP column separates the peptides by their hydrophobicity. With increasing percentage of organic solvent in the liquid phase the peptides can be eluted sequentially. Despite this separation step, many different peptide species arrive at the ionization source at the same time which, when measured, results in a mass spectrum, comprised of multiple distinct mass over charge rations, as shown in Figure 7.

Introduction

14

Figure 6: Liquid chromatography coupled to an ion trap mass spectrometer (Peng and Gygi, 2001).

Peptides that are submitted to mass spectrometry may be analyzed in three aspects, their charge, their molecular mass, and their amino acid sequence. All of this information cannot be obtained from a single cycle of MS with the setup employed. Therefore, three subsequent cycles of MS are automatically performed on the peptides arriving at the ionization source of the mass spectrometer. Also each cycle of MS needs to be performed on newly ionized peptides. First the peptides are separated by nano LC, and then they are ionized and sprayed into the mass spectrometer. With RP-LC connected upstream distinct time windows become available in which the eluate arriving at the ionization source is comprised of a more or less well defined peptide composition, depending on the flow rate, the prior separation and other parameters. In the first MS cycle the mass over charge distribution of available peptides in the current fraction coming from the reverse phase column is resolved. Figure 7 shows a so called full scan, comprised of all peptide ion-types at a defined time of chromatography. The peptide with the highest intensity, here at a mass over charge ratio (m/z) of about 626 Dalton (Da) will be automatically selected for further analysis. Two cycles of MS follow this full scan. For the second MS cycle first the most abundant ion as measured in the full scan is selected. Then the quadrupole is used to exclude all masses significantly smaller or greater so that the mass

Introduction

15

distribution of peptide species in the ion-trap is very homogeneous. The ion-trap itself can also be used to deliberately stabilize and destabilize ions contained therein to achieve the same, or enhance the result.

626

100 95 90 85 80

835

75

852

Relative intensity

70 65 60 982 55 50 45 40

610

35 957 30

1156

872

25

885

766 445

20

1024 579

1252

15 10 5 0 400

600

800

1000

1200

1400

1600

1800

2000

m/z

Figure 7: Mass spectrum of a full scan. The most abundant ions are labeled by their rounded corresponding mass over charge ratio (m/z). Spectrum recorded by Christine Markert; annotated by author.

This second MS cycle is called a zoom scan (Figure 8). This zoom scan analyses the area around the selected peptide with a higher resolution, thus resolving isotope peaks, as well. Isotope peaks occur due to the fact that with a certain change isotopes of atoms used to build biological molecules are incorporated. The C13 isotope of carbon plays the most important role in this context. Isotopes differ from the normal atom by usually one, but also multiple, atomic units. The chance for incorporating multiple isotopes decreases with the number of isotopes as can also be deduced from the falling intensities of subsequent isotopic peaks in Figure 8. Here a window of 10 Da around the most abundant ion is zoomed. The three well defined isotopic peaks

Introduction

16

are labeled in the figure. The charge can now be calculated by inversing the mass difference in-between two consecutive peaks (C). In this example ∆m/z is 0.5 Da for all “C”, labeled in Figure 8.

626.1 100

+ d Z ms [ 622.30-632.30]

Relative intensity

626.6

First isotope peak

c 50

627.1

c

c

Second isotope peak

Third isotope peak

627.7

0 622

623

624

625

626

627

628

629

630

631

632

m/z

m/z

Figure 8: Zoom scan of the most abundant ion from Figure 7. The well resolved isotope peaks are labeled. The mass difference in-between two matching peaks is indicated by double arrows. The label “C” indicates that the charge can be calculated by the difference in mass at this locus. Spectrum by Christine Markert; modified and annotated by the author.

The inverse of 0.5 is 2 which means that the charge (z) of this peptide is 2. Knowing the charge renders calculating the mass trivial. The m/z ratio is 626.1 Da which when multiplied by 2 gives the mass of the ion. The mass of the doubly charged ion is therefore 1252.2 Da. Two hydrogen atom masses may now be subtracted from the mass of the ion to determine the uncharged mass of the peptide (1250.2 Da). The overall process of full-scan, zoom-scan, and tandem MS can be repeated, while shifting the focus in-between several high abundant peptides during a time interval. In this study, the most abundant ion was selected first and measured. In the following one minute interval, which is defined by the time window, in which a homogeneous distribution of peptide species, given by RP-LC, is guaranteed, as many of the remaining high abundant ions as possible were recorded, while the already processed

Introduction

17

m/z ratios were blocked. It was thus possible to analyze as many as 20 distinct peptides in a one minute cycle eluating from the reverse phase column. Two of the three properties, charge and mass, of the peptide ion were easy to determine as shown above. The first MS cycle aided in selection of a peptide ion, which’s mass and charge was determined in the second MS cycle. The third MS cycle is used to determine the third property, the amino acid sequence. This process is somewhat more involved. A whole category of software tools tries to determine the amino acid sequence of an MS/MS spectrum (see Section I5). Other algorithms try to correlate the peak list forming the MS/MS spectrum with entries in sequence databases (see Section I4). A recent summary of a variety of these tools can be found in Kapp et al., 2005 and Shadforth et al., 2005a. The process of generating MS/MS spectra is quite involved and therefore dealt with separately in the following section.

3.3.1

MS/MS

In an MS/MS experiment, ions are fragmented and the fragmentation spectrum is recorded. Two mass analyzers are necessary for this approach one to select the ion for fragmentation and the next to analyze the resulting fragments. These mass analyzers can be separated by space, as realized in triple quadrupole and hybrid instruments, or in time as in trapping instruments (Griffiths et al., 2001). The actual fragmentation process can be manifold. Peptides can be fragmented by electron capture dissociation (ECD), by post source decay (PSD), and by collision induced dissociation (CID), to name only a few (Iavarone et al., 2004; Keough et al., 2000). Since differences in the composition of the resulting spectra are significant, only the method used in this study will be discussed which is low energy CID. Collision induced dissociation is achieved by forcing the ions to collide with gas molecules (here helium) in order to achieve fragmentation (Hsu et al., 2005). Peptide ions, which collide with helium in a CID experiment, can and will break at every possible bond of their chemical structure. The abundance of the possible peptide fragments (labeled in Figure 9) depends on the energy that was used for the collision. In this study low energy CID (Clauser et al., 1999; Fernandez-de-Cossio et al., 2000; Shadforth et al., 2005a) was performed leading primarily to b, y and a-ions (Figure 9),

Introduction

18

but peptides also fragment multiple times, creating internal fragments, due to the low energy and the therefore higher chance to collide multiple times. This is not as prominent in high energy CID (Fernandez-de-Cossio et al., 1998; Shadforth et al., 2005c). An annotated spectrum with labeled b and y-ions is presented as Figure 13.

Figure 9: This figure shows the backbone of an amino acid sequence. The side chains of the amino acids are labeled R1 through R4. Breakages within the backbone create the corresponding ions a-x, b-y, and c-z, respectively. Two breaks in the sequence may create internal fragment ions (ifi). Side chain losses (ssl) are also a common occurrence. The N-terminal ion fragments are called a, b, and c type ions while their corresponding C-terminal ions are named x, y, and z ions. All ions can further be modified by loss of one or multiple chemical groups (e.g. H2O, NH3).

The mass over charge ratios of the fragment ions resulting from the CID experiment are measured after fragmentation and a typical fragmentation spectrum, also called tandem MS, MS2, or MS/MS spectrum is presented in Figure 10. The uncharged mass for the peptide, which led to the fragmentation spectrum in Figure 10, is 1588.6 Da. The charge was determined to be 2. As can be seen from this figure, the MS/MS spectrum is not very well resolved in the lower and the higher mass regions. In fact, the first measured peak is at around 250 Da, thus missing about two fragment ions in that region. The region above 1400 Da is also quite bare of peaks, which leads to at least one missing ion fragment in that region. Missing information in these regions of mass spectra is a common problem. Random noise (Frank and Pevzner, 2005) and chemical noise (O'Connor P et al., 2006) adds even more problems to the later

Introduction

19

processing of the spectra (see Sections I4 and I5). Measurement of fragmentation spectra of fragmentation spectra (MS3) promises to overcome many problems of MS2 since it adds confidence to the each further fragmented fragment of the MS/MS experiment (Olsen and Mann, 2004).

Figure 10: MS/MS spectrum of a peptide in an m/z window from 200 to 1600. The spectrum was measured by Bianca Naumann. It was prepared for display by the author. The most intensive peaks per region are labeled with their rounded m/z ratios, if they were not too close to a more intense peak from a following or previous.

The MS/MS spectra obtained through mass spectrometry along with their charge and mass can be used to determine the amino acid sequence of the peptide that was originally fragmented. And in turn the peptide may identify the protein it was derived from. There are roughly two approaches. One correlates parts of the spectrum, or the full mass list to a database containing sequence information (Section I4) and essentially looks up the answer in a “book” (Sadygov et al., 2004). The other tries to determine the sequence without the help of a database and attempts de novo amino acid sequencing (Section I5).

Introduction

4

20

Database Searching

Mass spectrometry of peptides to identify proteins was done before fragmentation spectra were used for the same purpose. The protein was cleaved into peptides by a protease, usually trypsin, and the mass of the resulting peptides were measured. The idea is that each protein consists of a distinct set of peptides which can be mapped against an in silico digest of a sequence database. The approach, called peptide mass fingerprinting (PMF), (Giddings et al., 2003; Mann et al., 1993) is obviously vulnerable to the complexity of the mixture. With the increase of peptide species in the mixture the possible combinations increase exponentially thus making a correct protein assignment difficult, in respect to the accuracy of the MS/MS data (see I6.1). Furthermore, peptide masses may not be unique in sequence databases at the resolution of a given mass spectrometer. This can be seen in Publication 4 where the mass distribution of peptides of the digest of the six-frame translation of the C. reinhardtii genome is reported. It is obvious that with increasing peptide mass the number of peptides in the same mass range decreases thus rendering higher mass peptides more discriminating in PMF. Furthermore, other problems for example on the mass spectrometric level may hinder ionization or detection of peptides (Mann and Pandey, 2001). Lastly proteins may be rather short and might not yield a sufficient amount of detectable peptides. For these inherent problems PMF does not play an important role in protein identification from mass spectrometric data in higher eukaryotes with complex gene structures. The introduction of MS/MS spectra for peptide and protein identification made it feasible to amend the peptide mass with additional information to make the peptide identification and in turn the protein identification more reliable. Sequence tags (Mann and Wilm, 1994; Shevchenko et al., 1996) amend the mass of the peptide, as in PMF, with a short partial amino acid sequence, which is determined from the spectrum (see Section I5), and its positioning within the peptide. Thus four parameters define a sequence tag a) its mass, b) its partial amino acid sequence, c) the mass before the start of the partial amino acid sequence in the peptide, and d) the remaining mass after the end of the partial amino acid sequence within the peptide. Partial sequences can be searched for in sequence databases, usually presented as plain text-files in fasta-format (Pearson and Lipman, 1988). Furthermore, the masses

Introduction

21

of the resulting fragments of the in silico digest of these files are used to filter the results. Another filter is presented through the positioning of the partial amino acid sequence. These three filters are very restrictive and more discriminating than PMF as can be seen from the mathematical assessment in Publication 4. Therefore it is widely used today and new developments are reported regularly (Bafna and Edwards, 2001; Savitski et al., 2005; Sunyaev et al., 2003; Tabb et al., 2003). Another approach developed around the same time as sequence tagging makes use of the complete MS/MS spectrum (Eng et al., 1994). The implementation of this method, named Sequest, will be discussed in detail in Section I4.2 since it was used in this study. Sequest along with Mascot (Perkins et al., 1999), which employs sequence search, ion search, PMF, and introduces a probabilistic based scoring scheme for the first time (see Section I4.1), are the so called industry standards for software in this area today (Cannon and Jarman, 2003; Kapp et al., 2005; Shadforth et al., 2005a). Besides Sequest and Mascot, numerous tools that match mass spectrometric data to sequence databases are available. ProFound (Zhang and Chait, 2000) uses a Bayesian approach to assign significance to the matches. SCOPE (Bafna and Edwards, 2001) and ProbID (Zhang et al., 2002) introduced new probabilistic scoring systems. PepHMM (Wan et al., 2005) uses a hidden Markov model (HMM) for peptide identification, other programs use the sequence tag approach such as GutenTag (Tabb et al., 2003) and MultiTag (Sunyaev et al., 2003). Also the FASTA approach was adapted to work with mass spectrometric data (Mackey et al., 2002) and finally artificial neural networks were tried for peptide identification (Baczek et al., 2004). This short enumeration of database search programs is far from being complete not only because the number of programs is increasing regularly but also because the sheer amount of programs is enormous. Some of them have been reviewed, but again the list is far from being complete (Chamrad et al., 2004; Kapp et al., 2005; Shadforth et al., 2005a). Mascot and Sequest will be studied in more details in the following two sections, Sequest because it was used in this study and Mascot because it is the second industry standard among the variety of currently available database search algorithms to match MS/MS data to sequence database.

Introduction

4.1

22

Mascot

Mascot (Perkins et al., 1999) was developed based on the MOWSE (Molecular Weight Search) scoring scheme (Pappin et al., 1993). MOWSE is essentially a PMF search in a peptide mass database. The peptide database is created by digesting a protein database by a number of enzymes in silico including incomplete and missed cleavages. The non redundant protein database OWL was used initially (Bleasby et al., 1994). The PMF search is than performed on this pre-digested database in a mass error

tolerant

fashion.

The

MOWSE

documentation

(http://www.es.embnet.org/Doc/FAQ/MOWSE.html) states that as little as 3-4 distinct peptide masses are sufficient for identification within a set of 50000 proteins, consisting of about 15 million peptides. One problem of the MOWSE algorithm is that peptides were grouped into 100 Da sized bins, which is about equal to the average amino acid mass thus creating possible identification ambiguities. As outlined above discrimination capabilities of PMF can be increased by including the information of fragmentation spectra (Veenstra et al., 2004). Mascot, which incorporates and extends the MOWSE system, does this and also integrates information contained in fragmentation spectra in a probabilistic framework, details of which have not been published (Perkins et al., 1999). Mascot also removes the necessity for a pre-digested peptide database and works directly from a database in FASTA format (Pearson and Lipman, 1988), containing the sequence information. The proteins contained in the fasta-file are then digested on the fly. Mass accuracy, peptide mass distribution in the database, MS/MS fragment ion series, and peak intensity values are considered in addition to PMF. For each reported peptide the probability of being a chance event is calculated. Mascot has recently been extended to include post translational modifications (PTM) in its database search (Creasy and Cottrell, 2002). PTMs are searched serially to avoid loss of discrimination. Mascot is available at http://www.matrixscience.com. Sequest, the second industry standard database search tool, which was used in the present study, is discussed in the following section.

Introduction

4.2

23

Sequest

Unlike Mascot, the Sequest algorithm is published in its entirety (Eng et al., 1994; Yates et al., 1995) in this section it will be explained in more detail. Knowledge about the algorithm employed will be helpful when limitations are assessed in Section I6. First some data reduction is performed by 1200

Sequest. A mass window around the precursor

1000 800

Intensity

ion peak and all but the most abundant 200

600 400

peaks are removed from the spectrum. Then

200 0

Sequest compares the measured mass of the

0

200

400

600

800

1000

1200

1400

m/z

peptide to the calculated masses of a digest of the sequence database. The best 500 matches are selected in this process. For each of these 500 matches a hypothetical spectrum is generated

Figure 12 Artificial spectrum of the peptide

WLQYSEVIHAR

comprised of a,b,c,x,z, and z ions.

and they are evaluated by their number of matching ions within a given mass tolerance, the abundances of their matching ions, the continuity of their ion series, and the presence of immonium ions, which are useful for detecting and confirming many of the amino acid residues in a peptide, although no information regarding the position of these amino acid residues in the peptide sequence can be ascertained from the immonium ions. The so generated ranked list is

0. 9

0. 7

then re-evaluated with a cross correlation

0. 5

0. 3

analysis. For this Sequest constructs an artificial

0. 1

-200

spectrum comprised of a selection of user

-150

-100

-50

0

50

100

150

200

-0. 1

selectable ions (e.g. b, y and a-ions for low energy CID as in this study). This spectrum may Figure 11: Cross correlation of the

look much like Figure 12. The intensities of the

spectrum in Figure 12 with itself.

fragments cannot yet be predicted correctly although there are approaches in that area (Bern et al., 2004; Havilio et al., 2003). Therefore the intensities in Figure 12 are arbitrarily chosen and do not reflect real world experimental results as in Figure 10. Once the artificial spectrum is constructed it can be correlated to the original tandem MS spectra acquired. This process of cross correlation can be described by overlaying the two spectra and then shifting one of

Introduction

24

them to the left and the right, respectively. For each position a correlation for the equality of both spectra at that particular offset is calculated. If the two spectra are similar, they will display a peak in a result graph at the appropriate offset. Cross correlation in this environment is only valid if the offset with the best fit is at zero. The depiction of a cross correlation could look like Figure 11. This figure displays a perfect cross correlation of the peptide from Figure 12 with itself. The algorithm employed to calculate this cross correlation is different from the one used in Sequest and it maximizes at 1 for a perfect fit. Zero on the contrary stands for no apparent similarity. The often discussed Sequest cross correlation factor, given as XCorr, does not behave like this. Its maximum may become infinite. Therefore thresholds for the significance of these findings have to be chosen empirically. They are currently also dependent on the charge of the precursor ion (see II3.2). Interestingly Mascot has no such limitations. Only those sequences in the database that match the measured precursor mass within an error tolerance are selected for cross correlation analysis. Although the fast Fourier transform of the cross correlation function is relatively quick it might take days to correlate a thousand spectra to a hundred MB sequence database if all peptides were considered. Therefore Sequest first selects the 500 best matching peptides according to their mass and a fragment ion count. These 500 sequences are then evaluated by cross correlation and a user specified number of sequences is reported. Mapping spectra to existing sequences in a database may fail or be computationally inhibitive in some cases (see Sections I6 and I7). Therefore, it may be useful to determine the amino acid sequence directly from the data contained in the MS/MS spectrum using a de novo sequencing approach which will be discussed in the following section.

Introduction

5

25

De Novo Sequencing

De novo sequencing algorithms seek to determine the underlying peptide sequence from mass spectrometric information alone. The rational behind this approach is that, as outlined in Section I3.3.1, peptides dissociate into predictable fragments. Looking at y-ions alone clearly shows that the difference in-between two consecutive y-ions in a spectrum represents the mass of one or multiple amino acid Figure 13. Other iontypes may provide additional and supporting information in this scenario. The best case occurs when a complete fragment ion ladder of at least one ion-type is present. Then the underlying amino acid sequence can easily be constructed. Figure 13 can serve as an example.

1x106 y14 y13 y12 y11 y10 b6 b7 b3 b4 b5

y9

y8 y7 b9 b10

b8

y6 y5 y4 b11 b12 b13

y3 y2 b14 b15

Intensity

800x103

600x103

KGV A

E

SLS P

A

V Q

A

F

V

S

S

V

F

A

Q V

A

P

S

L

S

E

A

V

G

K

400x103

200x103

0 200

400

600

800

1000

1200

1400

1600

M/Z

Figure 13: Tandem spectrum of the peptide sequence SLSPAQVSFAVEAVGK as predicted by PEAKS. B-ions and y-ions are labeled; the corresponding peaks are indicated by drop-lines from the label to the corresponding peak (b-ions red dots, y-ions blue dots). The sequence is indicated in the spectrum. The upper sequence is centered amidst the y-ions, while the lower sequence is centered amidst the b-ions. The spectrum was measured by Bianca Nauman; annotated by author.

Introduction

26

While the N-terminal and the C-terminal region of the spectrum are not very well resolved, due to limitations of the measurement procedure, most of the b and y-ions are present therein. Therefore, determination of the underlying peptide sequence was completely accurate. This could be achieved by the fact that the two ladders of b- and y-ions complement each other. The C-terminal region was determined by using the bion ladder, whereas the exact N-terminus was determined using the y-ion ladder. The inherent problem in de novo sequencing is however that it is not known which peak represents which ion-type in a given MS/MS spectrum a priori. Early approaches tried to generate all possible sequences that could account for a given precursor mass (Hamm et al., 1986; Sakurai et al., 1984). This approach, sometimes called exhaustive listing, makes it necessary to correlate each generated sequence to the spectrum. With increasing number of amino acids and/or rising mass, this quickly becomes computationally inhibitive. A peptide with the mass of 800 Da could give rise to 25600000000 sequence permutations (208). Sub-sequencing reduced this problem in that it scores only short amino acid sequences against the spectrum and extends high scoring sequences one amino acid at a time to yield full amino acid sequences in the end (Lu and Chen, 2004; Shadforth et al., 2005a). The approach that is now most widespread is the theory graph approach. This method does not refer to actual display as is done in programs that aid manual sequencing with an advanced visualization of the spectrum (Lu and Chen, 2004). The theory spectrum graph method constructs a node for each peak in the spectrum and assembles the nodes by connections in-between them if their mass difference can account for one or multiple amino acids in respect to the mass error of the mass spectrometer. The sequence of the peptide that gave rise to the fragmentation spectrum can now be determined by traversing the spectrum graph. All sequences determined in this step have to be scored against the spectrum in order to determine the one that best explains the fragmentation spectrum. There are many possible ways to traverse the spectrum graph. The shortest path can be computed (Fernandez-de-Cossio et al., 1995). Adversely, the longest path can be computed (Chen et al., 2001; Dancik et al., 1999). The sequence can be determined by directly trying to traverse from N- to C-terminus (Taylor and Johnson, 1997). Scoring of the sequences to the spectrum is an open problem and new methods are

Introduction

27

described regularly. Examples that use dynamic programming for scoring are among others PEAKS (Ma et al., 2003), Sherenga (Dancik et al., 1999), and an approach by Chen et al., 2001. Genetic algorithms (Stranz and Martin III, 1999), Hidden Markov Models (Fischer et al., 2005) and Bayesian approaches (Frank and Pevzner, 2005) are also employed. Finally Lutefisk (Taylor and Johnson, 2001) performs rescoring of the peptides and uses among other scores a cross correlation similar to Sequest. The present study first evaluated Lutefisk, PEAKS, and DeNovoX in order to determine which of these programs to integrate in the high throughput proteomics pipeline AutoMS (Appendix B). This assessment is shown in the Materials and Methods section (Section II4). The outcome was that PEAKS was selected due to higher accuracy in comparison to DeNovoX (Thermo Finnigan, San Jose, Ca, USA) and significantly higher throughput when compared to Lutefisk. The next section will therefore introduce PEAKS in some more detail.

5.1

PEAKS

In this study PEAKS (Ma et al., 2003; Ma et al., 2005) was used for most processing. In the first publication the authors state that they employ an exhaustive listing and use a dynamic programming approach to achieve efficiency. Furthermore, they declare that they do not employ a spectrum graph approach. They devised an algorithm consisting of four distinct steps. First they preprocess the raw spectrum by noise filtering, peak centering, and deconvolution of higher charged ion species to single charged ions. In the second step they compute the best 10000 sequences that can explain the precursor ion mass. Most typical fragment ions are considered in this analysis. Peak presence, peak intensity, and mass error are evaluated for both, b- and y-ion series. The third step rescores the 10000 candidates. The rescoring tightens the allowed mass error and considers more ion types than the preliminary score, such as internal and immonium ions. In the last step a confidence score is computed for the best scoring sequences from the third step. The number of reported sequences is variable and was set to five in this study. At least one version of PEAKS includes a database matching functionality which uses the same algorithm for database matching as it employs in de novo sequencing.

Introduction

28

Both approaches, database searching and de novo sequencing, have certain limitations and advantages. These problems will be explored in Section I6.

Introduction

6

29

Limitations

Both, database searching with mass spectrometric data, and de novo sequencing have limitations. At least one parameter, spectral quality, influences both methods (Section I6.1), but due to a more constricted search space database search methods are less affected by spectral quality than de novo sequencing algorithms (Shadforth et al., 2005a). This advantage of database search methods is also their greatest drawback (Section I6.3). Prediction and identification accuracy and the resulting confidence are issues in all algorithms introduced so far (Section I6.2). Searching databases in an error tolerant fashion using de novo predictions as search strings, which is done in the present study, has been realized in MS-Blast (Shevchenko et al., 2001). This approach adapts the BLAST algorithm such that short amino acid sequence, originating from de novo amino acid sequence prediction, can be searched in a sequence database while incorporating some known factors specific to mass spectrometry. These include the presence of tryptic cleavage sites and the equivalence of Leucine and Isoleucine among others. This approach has been advanced to be useful for organisms with unsequenced genomes by using homology with closely related species (Liska and Shevchenko, 2003). It was further improved by not requiring de novo predictions anymore, but rather working from an alignment of multiple sequence tags (Sunyaev et al., 2003). This robust approach has later been shown to not only work with protein databases, but also with EST databases (Liska et al., 2005). While this strategy seems most potent, it still depends on sequence availability in databases (see Section I6.3).

6.1

Quality of MS/MS Spectra

De novo sequencing as well as the scoring methods in database searching relies on the discernable presence of fragment ions in the MS/MS spectrum. Therefore parameters like accurate mass measurement are of great importance (Clauser et al., 1999). Furthermore, de novo algorithms perform better in presence of complete fragment ion ladders within the spectrum (Bafna and Edwards, 2003; Mann et al., 2001).

Introduction

30

The spectral quality can be assessed by a number of methods (Bern et al., 2004; Purvine et al., 2004), but only a very recent algorithm lays the focus on fragment ion series (Yan et al., 2005). Correct scoring of sequences to spectra depends on the search space and the spectral quality. The search space in database search is limited to just the sequences in the database, which is why the influence of spectral quality is not as high as on de novo sequencing. The search space in de novo sequencing is variable, and is theoretically given by all sequence permutations that can explain a given precursor mass. This quickly outgrows any sequence database in use today as mentioned in Section I5. As can be seen from Figure 10 and Figure 13 the fragment ions are accompanied by noise which hinders the correct detection of the fragment ions produced by the fragmentation process. This poses more problems to de novo sequencing than to database searches since sequences in database look-ups are already known which is why noise is more tolerable than in de novo sequencing. Methods for noise and spectra filtering have been devised (Alves and Yu, 2005; Bern et al., 2004; Gentzel et al., 2003), but they have not been incorporated in neither de novo sequencing nor into database search algorithms so far.

6.2

Prediction and Identification Accuracy

The quality of de novo predictions has not yet been studied in detail. This is the first study that employs a de novo algorithm in a large scale experiment. This study is however limited to the low-energy CID spectra from one ESI quadrupole ion trap mass spectrometer. De novo sequencing algorithms only predict a fraction of the sequences when compared to database searching algorithms (see Section II4). No assessment was made whether the identifications were true positives. The prediction accuracy of database search programs has been studied. Actual values of true and false positive as well as negative rates depend on the study and the method employed (Kapp et al., 2005; Resing and Ahn, 2005). A number of algorithms have been devised to enhance confidence in peptide identifications (Colinge et al., 2003; Eddes et al., 2002; Razumovskaya et al., 2004). Peptide identification accuracy depends on many factors such as the fragmentation method employed, the resolution

Introduction

31

of the mass spectrometer as well as its signal to noise ratio. Therefore these values are machine and method dependent and are thus difficult to assess. Protein identification, when based on multiple peptides, is however undisputed (Veenstra et al., 2004).

6.3

Sequence Availability in Databases

Proteins and peptides that are not in the database searched cannot be identified by the database search approach. There are a number of reasons why sequences may be absent from databases. The region has not been sequenced or there were errors while sequencing. Furthermore there are reasons why sequences seem absent from a database even if they are present. There may, for example, be problems to identify sequences from mass spectrometric data when they contain PTMs which shift the precursor

mass

and

therefore render the peptide unidentifiable by database search

algorithms.

This

intron

exon

Left Peptide Fragment

Right Peptide Fragment

...(K)GSGDAAYP

GGPFFNLFNLGK...

problem is known and one workaround is to create several databases with each containing a different PTM for

all

GTGAGCAGATAGAGGAGAGAGCGCGCGAGAGAGGGCCGCGG CTGCAGCTGGTGTGGGCCAAGGCTGGCGGAGACAGCCAGCA GGGGTCAGGGGGAGGGGCACAGGGCAGAGGCAGACGCCGGC GCTGTCGGGTGCGGGATGCGGGTTGCTGATGGATGGAAACT CCGATCCGCGGCGGACGGTTTCGTGCTATGTAGCTGCTCAA CAGGGTTTGCATGCTCCTGGCTGACAGCGCATATGACGGTC CCTTCCCCCGCTCCGCAG Intron

peptides in the

database.

This

increases

not only

Figure 14: Depiction of the peptide GSGDAAYP-

but

GGFFNLFNLGK as found by GPF. The complete gene

false

structure is shown on top. The two fragments of the exons

positives, too. Furthermore,

that defined the intron are connected via the arrows to the

with the amount of possible

gene structure. The position of the intron-split is indicated

increases

PTMs

runtime, rates

this

of

analysis

can

by the lightning bolt. The intron is shown in the lower box.

become computationally inhibitive (Duncan et al., 2005). Interestingly, this is not the case for de novo sequencing algorithms which could, if implemented, account for even unexpected modifications (Hansen et al., 2005; Lu and Chen, 2003).

Introduction

32

Peptides that are present in the six frame translation of a genomic database may seem absent to database search algorithms if they are split by an intron when deduced from genomic DNA (Figure 14). This is even more prominent if there are alternatively spliced isoforms of the protein which would lead to several possible peptides and proteins. This may be especially striking in low abundant peptides which might not be found in EST or other proteomic data. It has been estimated that 40-60% of all human genes are alternatively spliced (Modrek and Lee, 2002). This is adding to the finding that each gene in higher eukaryotes may contain statistically four introns (Deutsch and Long, 1999). This again is inline with the assumption, that about 25% of peptides, from higher eukaryotes, measured by mass spectrometry might escape detection because they are split by an intron on the “genomic level” (Choudhary et al., 2001). Introns can be introduced to databases if they are known, e.g. found as EST sequences but from the EST and genomic data obtained for model organism like human, mouse or Arabidopsis, it is obvious that roughly half of the gene products that are encoded by a genome cannot be annotated in this fashion (Okazaki et al., 2002; Reboul et al., 2003; Seki et al., 2002). Introns can also be predicted from the genomic sequence as outlined in Section I2. However the prediction accuracy of these approaches can only be taken as a rough approximation of the real gene structure. Given these limitations, the specific aims of the present study will be presented in the following section.

Introduction

7

33

Specific Aims

Given the power of de novo amino acid sequencing, from mass spectrometric data, to even account for unexpected post-translational modifications, it would be great to make efficient use of these predictions. Given the power of the database search approach to even detect sequences from lower quality MS2 spectra, it would be of use to establish a platform that would be able to get the best of both approaches.

We aimed to develop an algorithm that would enable the detection of peptides split by an intron on the “genomic level” in order to back-up annotation with experimental data and to make detection of alternative splicing feasible. The algorithm was first established in the patent application (Publication 1). It went through several developmental stages and later helped to identify putative alternative splicing (Publication 4).

We also aimed to establish an automated pipeline that would bundle the three algorithms. First de novo amino acid sequencing followed by our new algorithm and alongside and subsequent to this computation a database search approach (Appendix B).

Additionally, we aimed to store the relevant data in a relational database to assure flexible protein identification from mass spectrometric experiments in an automated fashion. This goal was achieved with the design of a Microsoft Access database (Publication 4; Appendix B).

Materials and Methods

34

II Materials and Methods This study is limited to the computational processing of mass spectrometric data. The experimental procedures to gain this data were not performed by the author. As a thorough understanding of these procedures may be helpful to the understanding of this study, please refer to the individual publications and the references cited therein for more details. First an overview over the experimental set-up with focus on the computational processing is presented in Section II1. Following this, the experimental methods are briefly introduced (Section II2). Then the computational processing is described in Section II3. Following the description of the processing, the rational for choosing a particular de novo sequencing algorithm in this study is outlined in Section II4. Section

II5

deals

with

the

currently

available

implementations

of

the

GenomicPeptideFinder which uses the results from de novo prediction algorithms as input. Section II6 gives a crude assessment of the probability of a chance event, given a peptide mass. This probability is used in protein identification (Section II7), especially in identification of proteins by a single supporting peptide.

1

Experimental Design

The experimental design including the computational data processing is schematically shown in Figure 15. The simplified experimental procedure is shown on top with the result being a raw file, which represents all mass spectra acquired from one sample e.g. one 1D SDS PAGE band. The experimental setup will be detailed in the next section. Methods involving computational processing will be presented in Section II3.

Materials and Methods

35

LC MS/MS

Raw File

AutoMS Dta-File Setup PEAKS

Query Creation

Sequest

Filters

Result Extraction

GPF

Sequest

Result Extraction

Database

MS Access

Figure 15: The experimental setup including mass spectrometry (on top), the computational processing (in the AutoMS box), and the data storage and visualization (MS Access box) are schematically presented. Arrows indicate directional flow of data and the sequence of processing steps. Dashed lines indicate possibilities for data filtering.

2

Experimental Procedures

In the first two publications data from 2D SDS PAGE experiments, performed by Christine Markert, was used. In the third publication data from 1D SDS PAGE, performed by Bianca Naumann, was used. The procedures are shortly outlined in the following sections. Both methods produce a raw-file which contains all recorded spectra during one mass spectrometric experiment. That means that each raw-file contains the spectra of one spot or one band, respectively.

2.1

1D SDS PAGE

A cell-wall-less Chlamydomonas reinhardtii strain was cultivated in iron sufficient and iron deficient media, in the presence or absence of isotopically labeled arginine (6

Materials and Methods

36

C13 isotopes replaced the C12 carbons in the molecule.). The crude cell extract was centrifuged in a sucrose gradient. Two fractions, enriched in thylakoids, were pooled and then separated on a 1D SDS PAGE. Bands were cut from the gel and then digested with trypsin. The digest was channeled to the mass spectrometer by nano RPLC (Ultimate system, LC-Packings). Mass spectrometry was perfomed on an LCQDeca XP plus instrument from ThermoFinnigan, San Jose, CA, USA. More detailed description is available in (Naumann et al., 2005) and references therein.

2.2

2D SDS PAGE

A cell-wall-less Chlamydomonas reinhardtii strain was cultivated in iron sufficient and iron deficient media. The cells were extracted and the extract was centrifuged in a sucrose gradient. Two fractions, enriched in thylakoids, were pooled and separated by 2D SDS PAGE. Spots were excised from the gel, digested with trypsin and then submitted to mass spectrometry via nano RP-LC (LC-packings, Amsterdam, The Netherlands). The mass spectrometer used was a LCQ Deca XP iron trap instrument from Thermo Finnigan, San Jose, CA, USA. The procedure is described in (Stauber et al., 2003) and references therein.

3

Computational Processing

After extraction of the mass spectra from the raw-file, the computational processing splits into two pathways Figure 15. The right-hand path only employs Sequest in order to match the mass spectra to a number of databases (Section II3.2). The lefthand pathway is a little bit more involved. First the spectra are submitted to de novo amino acid sequence prediction (Section II3.1). Then the de novo predictions are filtered and converted to GPF queries (Section II3.3). The GenomicPeptideFinder then matches the queries to a genomic database (Section II3.4). The output of GPF is a fasta-file which is correlated to the mass spectra, which gave rise to the predictions, using a database search algorithm, here Sequest (Section II3.2). All these processing steps are automated in the proteomics pipeline AutoMS (Section II3.5). The results

Materials and Methods

37

from AutoMS processing, which are reported in a plain text file, are afterwards imported into the suiting database (Section II3.6).

3.1

PEAKS

The batch version of PEAKS was used in this study. A wrapper software providing a GUI was programmed in C++, using Microsoft Visual Studio .net. This software, called AutoPEAKS is now part of AutoMS. Additionally to the options available when using PEAKS, it is possible to set a charge cutoff in order to exclude ions of higher charged species, which usually cause longer processing times and do not yield significant amounts of useful information. All spectra from ions with a charge higher than two were thus excluded from processing with PEAKS. The parent ion tolerance was set to 1 Da and the fragment ion tolerance to 0.5 Da. For each dta-file 5 predictions were reported. We allowed the Cysteines to be Carboxy-amido-methylated. Since we used heavy arginine in the experiments involving the 1D SDS PAGES, we also allowed arginines to be 6 Da heavier. Both modifications were set to be variable.

3.2

Sequest

Widely used settings were employed when searching databases with Sequest. Our significant threshold for Xcorr values were 1.75 for singly charged, 2.5 for doubly charged and 3.5 for triply charged parent ions. We used Sequest to match all acquired spectra against several databases to get the maximum result. These fasta-files contained the following sequences which all stem from Chlamydomonas reinhardtii: 

The chloroplast and mitochondrion proteomes



The JGI gene models (release 2, http://genome.jgi-psf.org/chlre2/)



All available EST sequences



The six-frame translation of the genome

When matching GPF results against the corresponding dta-files, an enzyme was used that cuts after each letter “J”. Since “J” does not code for any amino acid and it is

Materials and Methods

38

therefore absent from the sequences this procedure ensures that none of the predictions are actually cut, but accepted as full length sequences.

3.3

Query Creation

The GenomicPeptideFinder does not directly accept PEAKS results as input. A level of indirection was introduced so that new de novo sequencing algorithms can easily be implemented to work with GPF. Furthermore some filtering can be performed at this step. We accepted all those predictions from PEAKS as input, that contained at least eight amino acids in their sequence and whose score was higher than 10%. These predictions were extracted from the fas-files created by PEAKS. In that file format modifications are not reported. The mass of the peptides are however adjusted to the assumed modification. If modifications were assumed by PEAKS, the mass of the GPF query was adjusted accordingly, i.e. the mass calculated from the peptide sequence was used for searching.

3.4

The GenomicPeptideFinder

All searches, whether performed on our Windows PC or on the LINIAC cluster, University of Pennsylvania, PA, USA, used the same GPF core compiled in the appropriate environment. There is thus no difference in algorithm employed while running in different environments. When matching the queries against the genome we enforced, that in the first search at least 5 amino acids of the prediction had to exactly match the sequences found in the database. For the second search, within a window of +- 700 amino acids around hits from the first search, at least 3 consecutive amino acids had to match. The difference of calculated and measured precursor mass was restricted to 1000 ppm. Tryptic cleavage sites (R or K) had to be present on both sides of the peptide sequence. These settings only allow us to find peptides of a minimal length of 8 amino acids. Shorter peptides are always missed. Also other peptides, which do not include stretches of 5 or more correct amino acids, are missed.

Materials and Methods

3.5

39

AutoMS

Proteomics pipelines are collection of tools that perform one or several analyses. They are usually fully automated and remove the burden of controlling many individual tools which are not only tedious but also error prone. Usually proteomics pipelines are set-up to achieve a specific goal. This is also true for AutoMS which is the pipeline in this study. It was designed to automate all desired computational steps from spectra extraction up to data filtering prior to database import (Figure 15). AutoMS was designed and programmed in Microsoft Visual Studio .net using the C++ programming language. Where possible, ANSI C++ was used. It runs under most Microsoft Windows operating systems. AutoMS automates the programs used in this study. It enables batch processing for hundreds of tasks which can easily be set and if required individually customized. Some data filtering can be performed and the significant results are reported in a plain text file which is easy to import in our database. All settings and options for the individual programs can be adjusted in AutoMS as well. The settings used are described in the individual sections of the tools employed (Sections II3.1-II3.4). File locations are adjusted to achieve a structured file repository. The code is robust enough to allow the software to run without interruption. GPF was run outside of the AutoMS environment on a computer cluster. Since all information was contained in a specific directory structure it was however no problem to integrate the results after processing on the cluster. Picking up further processing downstream of GPF did not pose any problem due to the same fact. The software is still in its beta phase, but the executable is available at http://www.automs.de.ms.

3.6

Database

The number of peptides detected in the experiments for the fourth publication made it necessary to create a relational database to hold and visualize the huge amount of data. The database was developed in Microsoft Access and all functionality was programmed in Visual Basic for Access (VBA). The database including all functionality is available at http://www.pepprotdb.de.ms. It was designed to enable analysis of data stemming from more than one computational tool. All datasets in the

Materials and Methods

40

database are retained within their biological context for example their location on the gel. The table space and the relations are depicted in Figure 16. The experiment is defined by the three tables Gels, Spots and Experiments with the latter referring to multiple measurements of entries in the Spots table, which is named Spots due to historical reasons and porting from an old version of the database, but may indeed contain PAGE bands as well. The three tables PeptideScores, Peptides, and PeptideReferences represent the data acquired by mass spectrometry and computational processing of the mass spectra. The table Proteins which is defined by the protein sequences, given in the table Sequences, contains information on gene models or confirmed proteins.

Gels

Spots

KOG

Experiments

PPDB

PeptideScores

Sequences

Predotar TargetP

Peptides

Proteins

Homologies ChloroP

PeptideReferences

ProteinReferences

Localisations HMMTop

References Loops

Entity

One

to many relation

One

to one relation

Figure 16: Overview of the table space in the database. Result and temporary tables are not shown. Only table captions and their relations are shown.

Most of the information, which is partly represented in associated tables, can be automatically retrieved from the web, given a sequence. Information concerned with the localization of the proteins is stored in the tables PPDB (van Wijk, 2004), Predotar (Small et al., 2004), ChloroP (Emanuelsson et al., 1999), TargetP

Materials and Methods

41

(Emanuelsson et al., 2000), HMMTop (Tusnady and Simon, 2001), and Localisations. Predicted functions and homologies are represented in the tables KOG (http://genome.jgi-psf.org/cgi-bin/kogBrowser?db=Chlre3) and Homologies. The database was primarily designed to enable integration of the results from several distinct tools into one database, while still retaining the possibility to distinguish the origin of each dataset. Two problems were detected by using Microsoft Access as the database. One is that collaboration with several researchers is very tedious with local databases. The other is that graphical visualization was not possible in that environment.

4

Assessment of De Novo Sequencing Algorithms

Three de novo sequencing algorithms were compared in this study. The spectra measured for two bands from the 1D SDS PAGE experiments performed by Bianca Naumann were submitted to de novo amino acid sequencing by all these three programs. The overall execution speed of the algorithms was also tested. PEAKS (Bioinformatics Solutions Inc., Waterloo, ON, Canada), using a dynamic programming approach with subsequent rescoring, was the fastest algorithm, processing one band in about 2 hours. DeNovoX (Thermo Finnigan, San Jose, CA, USA), an expert system approach to de novo sequencing, used about twice the amount of time for the same dataset. Lutefisk, using a theory graph approach with rescoring (Taylor and Johnson, 1997) needed the longest time to process the spectra (about 12h). The same data set was also processed with Sequest (Thermo Finnigan, San Jose, CA, USA). The spectra from this dataset were also correlated to the six frame translation of the Chlamydomonas reinhardtii nucleotide database release 2 available at JGI (http://genome.jgi-psf.org/chlre2/chlre2.home.html). The de novo predictions were also searched against the same database, using GPF using the search parameters as described in Section II3.4. All significant peptide identifications were visualized in a Venn diagram and complementary results were computed (Figure 17).

Materials and Methods

42

Sequest PEAKS 28(13) 115 11 27 (16) Lutefisk

15 3

5

3(0)

8 (2) 1(0)

9 (4) DeNovoX

3 (0) 27 (16) Lutefisk

Figure 17: Four way Venn diagram showing the de novo sequencing results from DeNovoX, Lutefisk and PEAKS, as well as the database search results for Sequest. Absolute number of significant identification is given and accompanied by number of intron split peptides in brackets, if applicable. Intersections show the number of complementary results. The computational processing for one of the two bands is represented here. Lutefisk is represented twice in an open circle, indicating that it is actually contains only one result, but is duplicated for graphical representation.

This assessment shows that Lutefisk and PEAKS performed about equally well on this dataset, but significantly worse than Sequest. All three detected more peptides than DeNovoX. These results and the fact that PEAKS outperformed Lutefisk and DeNovoX in respect to processing speed represent the rational for choosing PEAKS as the de novo sequencing algorithm in this study. Another interesting finding here is that the data from the different de novo sequencing approaches is complementary. The increase in processing time when using all three programs inhibited the exploitation of complementarity in this study.

Materials and Methods

5

43

GenomicPeptideFinder Implementations

There are currently three implementations of the GenomicPeptideFinder (GPF).

1. The first one, used in the first two publications is a version with graphical user interface (GUI) that was written in JAVA programming language. This version of GPF runs on all operating systems for which the JAVA virtual machine has been implemented. That are essential all commonly used systems. It is available for download at http://www.java-gpf.net.ms. Speed of this implementation is extremely slow but increases with each new implementation of the JAVA virtual machine. 2. The second implementation of GPF also contains a GUI, but only runs on Microsoft Windows operating systems. It was programmed in C++, in order to overcome speed limitations of the JAVA version. More information is available at http://www.win-gpf.net.ms. 3. The third implementation of GPF is a command-line application that was successfully compiled and used on one UNIX and at least two distinct Microsoft Windows operating systems. It does not provide a GUI which is why wrappers are needed for successful usage. This implementation was used on a large computer cluster at the University of Pennsylvania for the third publication. For its “embarrassing” parallelity, it is easy to implement in distributed environments. No download of the executable, but a description is given at http://www.parallel-gpf.net.ms.

6

Confidence Calculation

The six-frame translation of the Chlamydomonas reinhardtii genome was digested with trypsin in silico. The resulting tryptic peptides were sorted into 1 Dalton sized bins. The count of peptides per bin was stored. We than plotted the bins against the number of peptides in the bins (Figure 18). SigmaPlot, Systat Software, Point Richmond, CA, USA, was used to fit the mass range (300Da-5000Da) that contains the mass range which is usually best resolved in the mass spectrometers used in this

Materials and Methods

44

study. Equation 1 represents the best fitting equation for the data, as given by SigmaPlot.

M Of =

3.31343 ⋅10 6 5.77642 ⋅10 9 + − 902.428 x x2

x:

Peptide mass in Dalton

Equation 1

M Of : Frequency of the occurrence of a given peptide mass

Given Equation 1 and the complete number of tryptic peptides resulting from the in silico digest, the chance for a random occurrence of a peptide with mass x can be calculated.

PE =

M Of 24177277

PE :

Equation 2

Chance for random occurrence of a peptide with mass x

To assess the apparent database size for a peptide with mass x the mean number of amino acids that best fit the peptide mass is calculated.

N AA =

x M AA

Equation 3

N AA : Mean number of amino acids per peptide with mass x M AA : Mean amino acid mass (assumed to be 100Da)

The apparent database size is thus given by the number of peptides within the mass window and the mean number of amino acids per peptide.

DB App = M Of ⋅ N AA

DB App : Apparent database size

Equation 4

Materials and Methods

45

Given the apparent size of the database and the number of matching amino acids in the peptide (here 5, or 5 + 3 for intron-split peptides) we calculated the probability for a random occurrence.

PA = PA :

DB App

205

Equation 5

Random occurrence of a string of amino acids in the apparent database

To calculate the random occurrence of any peptide only dependent on the mass of the peptide we combined the two probabilities.

P = PA ⋅ PE

Equation 6

Equation 6 can be substituted with the appropriate variables and then be simplified to yield equation 7.

P=

1078 1.24 3.54 ⋅10 −4 + 2 + + 2.63 ⋅10 −11 x + 1.93 ⋅10 −7 x3 x x

Equation 7

This assessment of probability is just an approximation, but it appears accurate enough, to conclude that any peptide found with the combination of PEAKS, GPF, and Sequest can be considered significant. Measured peptide masses are typically inbetween 1000 and 3000 Da corresponding to probabilities in the range of 3.2*10-6 to 6.8*10-7.

Materials and Methods

46

Number of tryptic peptides

50000

40000

30000

20000

10000

0 500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Bins [Dalton] Figure 18: The six-frame translation of the genome of Chlamydomonas reinhardtii was computationally digested by trypsin. All resulting fragments were sorted into one Dalton sized bins according to their calculated molecular mass. For each bin, the count of tryptic fragments is reported (black dots). The best fit was generated using SigmaPlot 9.0 and is displayed as a red line. It was calculated for the range in-between 800 and 5000 Dalton.

The peptide mass distribution in Figure 18 was calculated for Chlamydomonas reinhardtii. The same calculation for Toxoplasma falciparum gives a very similar distribution pattern, slightly offset to the right (data not shown). This estimate was done in order to assess the significance of peptides with little sequence information, but additional mass information. It can be seen that in the region of interest (starting at 1000 Da) the number of tryptic peptides sharing the same mass are rapidly decreasing. This in turn means that the significance of a peptide assignment increases with its mass. The peptides were sorted into 1 Da sized bins, since that reflects the approximated error of our mass spectrometer. It would be possible to use smaller bin sizes for more precise instruments. That would shift the curve down and render peptides more significant overall.

Materials and Methods

47

The tool PepDist (http://www.pepdist.de.ms), written in Visual Studio .net, using C++, was used to digest the fasta file containing the genome of C. reinhardtii.

7

Protein Identification

Proteins, when identified via mass spectrometry and computational processing of the mass spectra, are usually considered confidently recognized when at least two peptides can be mapped to the protein. In line with this was that all those proteins that were matched by more than one peptide were considered confidently identified in the present study. Furthermore, all those proteins that were supported by at least two peptides, even when stemming from different algorithms, were considered confidently identified. Finally, all those proteins that were supported by at least one peptide that stemmed from the concerted processing of multiple algorithms were considered confidently identified.

Materials and Methods

48

III Publications The four publications that are that are bundled in the present study are first assessed in regards to their relations to each other in Section III1. Following this all publications are briefly overviewed in Sections 0-III1.4 with the publications directly following each overview.

1

Overview over the Included Publications

Four publications are included in this study. The two patent applications (Sections 0 and III1.2), which build a patent family, first introduced the suggested method to detect introns from mass spectrometric data with the help of de novo amino acid sequencing and database search. The later extention of the initial patent application introduces a method to validate peptides that are predicted to be split by an intron when deduced from genomic DNA. The publication in FEBS letters (Sections III1.2) extends the findings from the patent applications with additional experimental data. That study was performed using the de novo sequencing algorithm DeNovoX (Thermo Finnigan, San Jose, USA) and a JAVA version (see Section II5) of the algorithm proposed in the first patent application. In the preparation for the fourth publication first the usability of three different de novo sequencing algorithms was assessed (see Section II4) and a new C++ version of the (see Section II5) of the GenomicPeptideFinder algorithm was developed. The fourth publication (Section III1.4) used the improved version of GPF, and the PEAKS algorithm for de novo amino acid sequencing, in a high-throughput environment, thus elevating the method to a higher level. A high throughput automated platform, AutoMS, was developed and used for the computational processing of the data. This proteomics pipeline is introduced in Appendix B. The GPF algorithm went through three stages, proposition (Patents), single peptide evaluation (FEBS Letters) and high throughput possibility (Publication 4) and was then bundled with other necessary computational processing in a proteomics pipeline (Appendix B).

Materials and Methods

49

In the following the publications will be briefly introduced and then reprinted (Sections 0-III1.4). In case the publication has not been accepted, yet the manuscript is presented exactly as it was submitted for review. All manuals for software applications produced in this study, the source codes for the Java version of GPF and AutoMS, as well as additional information such as the complete dataset that led to the fourth publication, as a Microsoft Access database, are included on the accompanying CD. A more detailed description of the content of the CD can be found in Appendix C.

Publications - Patents

1.1

50

Publication 1

Hippler, M. and Allmer, J. (2003) Method to identify peptides from mass spectrometric data in genomic databases. Patent application number DE 103 41 595 A1, German Patent Office, published 31.03.2005, pending.

The patent application introduces the general algorithm for identification of peptides, split by an intron when deduced from genomic DNA, from mass spectrometric data. The algorithm was presented as a software application with graphical user interface programmed in JAVA. A small number of peptides, which were confirmed to have introns on the “genomic level” by EST data, were presented as proof of principle.

All programming and computational processing was performed by the author.

Publications – Patents

1.2

62

Publication 2

Hippler, M. and Allmer, J. (2003) Method to identify peptides from mass spectrometric data in genomic databases. Patent application number DE 10 2004 018 655 A1, German Patent Office, published 31.03.2005, pending.

This follow-up to the patent application (Publication 1) extends the general algorithm of intron detection with a method to discriminate in-between multiple putative predictions that arise from the same input. It also suggests a workflow that describes how to combine all necessary computational tools in order to achieve intron detection.

All programming and computational processing was performed by the author.

Publications - FEBS Letters

1.3

70

Publication 3

Allmer, J., Markert, C., Stauber, E.J. and Hippler, M. (2004) A new approach that allows identification of intron-split peptides from mass spectrometric data in genomic databases. FEBS Lett, 562, 202-206.

This publication provides the proof of principle that the approach to identify peptides that are split by an intron on the “genomic level” is feasible in the way it was proposed in the patent (Publication 1). It also introduces a software implementation written in JAVA programming language that performs the data processing. The data for a number of peptides is presented in the publication. Here the main downside of this implementation of the approach in JAVA becomes evident. It is computational quite intensive which lead to incredible long run-times. Processing of one peptide took about 12h on a regular PC.

All programming and computational processing was performed by the author.

FEBS Letters 562 (2004) 202^206

FEBS 28187

A new approach that allows identi¢cation of intron-split peptides from mass spectrometric data in genomic databases Jens Allmer1 , Christine Markert, Einar J. Stauber, Michael Hippler1; Lehrstuhl fu«r P£anzenphysiologie, Friedrich-Schiller-Universita«t Jena, Dornburger Str. 159, 07743 Jena, Germany Received 14 November 2003; revised 7 January 2004; accepted 28 January 2004 First published online 3 March 2004 Edited by Gianni Cesareni

Abstract We present a new approach that allows the identi¢cation of intron-split peptides from mass spectrometric data in genomic databases. Our algorithm uses small regions of peptide sequence information which are automatically deduced from de novo amino acid sequence predictions together with the molecular mass information of the precursor ion. The sequence predictions are based on selected collision-induced mass spectrometric fragmentation spectra. Fragments of the predicted amino acid sequence are aligned with each of the six frames of the translated genome and the precursor mass information is used to assemble the corresponding tryptic peptides using the sequence as a matrix. Hereby, intron-split peptides can be gathered and in turn veri¢ed by mass spectrometric data interpretation tools such as Sequest. # 2004 Published by Elsevier B.V. on behalf of the Federation of European Biochemical Societies. Key words: Intron^exon structure; Proteomics; Genome data; Mass spectrometry ; Search algorithm; Chlamydomonas reinhardtii

notation demonstrated that 50% of the predicted genes (about 4000 genes) needed corrections in their intron^exon structures. Recently, we analyzed light-harvesting proteins from Chlamydomonas reinhardtii in a detailed proteomic study [20]. There we realized that Sequest searches with mass spectrometric data identi¢ed several peptides in EST databases which could not be detected in the genomic database from C. reinhardtii. One possible explanation for this ¢nding is that these peptides are split by introns when deduced from the genomic sequence. It has been estimated that at least 20^25% [21] of tryptic peptides deduced from genomic databases are split by introns. Our new approach enables identi¢cation of these peptides in conjunction with mass spectrometric data interpretation tools such as Sequest and thereby de¢nes intron^exon borders. This approach is related to the sequence tag search algorithm [1,4,22] and uses fragments of amino acid sequences generated by de novo amino acid sequence predictions of tandem mass spectrometry (MS/MS) data together with the corresponding peptide mass of the respective precursor ion. We named this newly developed algorithm the GenomicPeptideFinder (GPF).

1. Introduction 2. Materials and methods With the increasing number of sequenced genomes, the need for new computational approaches is evident. Software tools are needed to cover a broad range of applications. One of the biggest obstacles is the correct identi¢cation of intron^ exon borders. To identify peptides and proteins from mass spectrometric data several strategies have been developed [1^6]. Up to now, none of these approaches has been able to detect peptides which are split by introns when deduced from genomic DNA sequences. Prediction of intron^exon boundaries for the identi¢cation of open reading frames using genomic data is performed by numerous software tools [7^15]. However, those predictions are often erroneous [15^18]. This has especially been outlined in a recent study [19]. An experimental veri¢cation of the Caenorhabditis elegans genome an-

*Corresponding author. Fax: (49)-364-1949232. E-mail address: [email protected] (M. Hippler). 1 Present address: Department of Biology, University of Pennsylvania, Philadelphia, PA 19104, USA.

Abbreviations: GPF, genomic peptide ¢nder; MS, mass spectrometry; MS/MS, tandem mass spectrometry

2.1. GPF data input Queries for GPF were generated using de novo amino acid sequencing software (DeNovoX, Thermo Finnigan). Results produced by DeNovoX with a relative probability equal to or greater than 0.1 were queried by GPF. Queries can include monoisotopic or average masses and an identi¢cation string for the peptide:

Query e:g:½RZAAYPG½VVCFNPYNLGK Z represents a cysteine that is carbamidomethylated (plus 57 Da). 2.2. Computer equipment GPF was originally programmed in Java1 and was tested on several platforms:

1. Pentium II, 400 MHz, 256 MB RAM, Windows 98 2. Pentium III, 966 MHz, 256 MB RAM, Windows XP (Laptop) 3. Pentium IV, 2400 MHz, 1024 MB RAM, Windows 2000 4. IBM RS/6000, F80 4-Way RS64 III 450 MHz Proc, UNIX 2.3. GPF functions and settings Each amino acid sequence prediction is computationally fragmented (all possible sequence stretches of a given size are produced) and the resulting fragments are used to search for identities in the six frame translation of the genomic database. Two searches are performed: the ¢rst one with a longer stretch of amino acids to cut down on processing time and the second one with a shorter sequence which is only invoked if the ¢rst one results in matches in one of the

0014-5793 / 04 / $30.00 M 2004 Published by Elsevier B.V. on behalf of the Federation of European Biochemical Societies. doi:10.1016/S0014-5793(04)00212-1

FEBS 28187 15-3-04 Cyaan Magenta Geel Zwart

J. Allmer et al./FEBS Letters 562 (2004) 202^206

203

Fig. 1. Abstract and simpli¢ed £ow chart of the work £ow of the GPF. Possible candidates are stored in a result database for further validation.

genomic sca¡olds. The more stringently the ¢rst comparison is set (a setting of ¢ve matching amino acids was used in this study), the faster the search. For the second search, the number of matching amino acids was lowered to three (in this study). In case an amino acid sequence matches the translated genomic sequence ( v 5 amino acids), the adjacent tryptic cleavage sites are predicted and the respective mass of the deduced tryptic peptide is determined. Then, four di¡erent processes are activated that are aimed to identify peptides in a genomic database which match the search criteria. Selected peptides are stored in a database (result database). When the respective precursor mass matches the tryptic peptide’s mass, it is extracted and stored in the result database (Process 1, Fig. 1). When process 1 does not result in matching peptides, process 2 is activated which leads to excision of a sequence between two independent amino acid sequence hits in a single sca¡old (Process 2, Fig. 1). The respective tryptic peptide is deduced from the genomic data and its mass is matched with that of the precursor ion. In case the masses match within a speci¢ed error range, the respective peptide is extracted and stored. When the mass of the tryptic peptide that harbors the two assembled sequence fragments does not agree with the mass of the precursor ion, the sequence is extended from the end of the sequence fragments that hit the translation of the genomic sequence along the corresponding reading frames using the deduced amino acid sequence as a matrix until the resulting tryptic peptide matches the mass of the precursor ion (Process 3, Fig. 1). Process 4 (Fig. 1) operates like process 2 but allows sequence errors in the genomic database. Processes 2^4 lead to the identi¢cation of putative tryptic peptides that are split by an intron when deduced from the genomic sequence. For peptide sequences which were found by GPF and were stored in the result database an E value and a score were assigned (see below). An additional validation step represents the correlation between the mass spectra from which the de novo predictions and therefore the queries for GPF were originally derived and peptides found by GPF using an MS/MS search tool such as Sequest. These evaluation steps enable GPF to identify peptides which are intron-split within the genomic sequence. For mass calculations monoisotopic masses were used. The error window for mass deviations between a measured mass and the mass of a deduced tryptic peptide was set at 700 ppm since this is the approximate error window of the ion trap mass spectrometer. A Java1 version of the GPF software with graphical user interface can be obtained for basic research purposes upon request.

2.4. Mass spectrometry and sample preparation LC-MS/MS analyses were performed with an LCQ Deca XP ion trap mass spectrometer (Thermo Finnigan), which was coupled to a nano-HPLC (Ultimate, LC-Packings), as described [20]. In case the charge state of an ion could not be determined, both doubly and triply charged ion states were taken into account for the de novo amino acid sequence predictions. Isolation of photosystem I particles and analyses by two-dimensional gel electrophoresis were performed as described in [20]. 2.5. De novo predictions DeNovoX1 (Thermo Finnigan) was used to interpret mass spectra which could not be identi¢ed by Sequest. The software provides two markers for prediction quality. One is the absolute probability which is only subject to the assumption that the chemical species being sequenced is a peptide. The other, the relative probability, further assumes that all necessary sequencing information is included in the spectrum. According to the DeNovoX1 manual an absolute probability of 20% or more combined with a relative probability of 75% or more is a strong indication that the sequence or subsequence is correct. We only examined full length sequence predictions with a relative probability equal to or greater than 10%. 2.6. Using GPF to connect de novo predictions to the original spectra The GPF software de¢nes a database of possible peptides from de novo predictions. This peptide list has to be checked against the original spectrum to actually identify the correct peptide. This is done by a correlation of the original spectrum with the in silico produced spectra based on this database by Sequest or Mascot. GPF, therefore, provides the link between de novo predictions and Sequest or Mascot evaluation of MS/MS data. 2.7. E value and score calculation An expectation value (E value) and a score were calculated. These two values are used as parameters to cut the overall workload. If peptides are clearly insigni¢cant (E value) or if they do not resemble the de novo prediction (score) they are not further processed and thus not stored. Threshold values can be set to exclude results. In our case all results were stored which is also the default setting. The E value re£ects the expectation for a peptide to occur randomly within the genome. It is therefore dependent on the size of the ge-

FEBS 28187 15-3-04 Cyaan Magenta Geel Zwart

0.172 CRGSVPFN[PN]FK

0.253

0.363 CRGSVNFP[PL]FK

[RZ]AAYPG[VV]CFNPYNLGK C

0.247 CRGSVN[DE][PL]FK

0.410 [LW]QYSEVLH[AR]

B

0.316 [PD]SQYSQVLH[AR]

The XCorr factor for each peptide was determined. Additionally E values and score values were calculated for each peptide. The two hits resulting in the best XCorr factors are listed for each query. a All hits are included, also duplicates which occur in case the de novo prediction gives several alternatives.

144

248

200

127

204

323

0.848 0.613 0.778 0.409 0.848 0.613 0.831 0.831 0.691 0.610 0.345 0.317 0.599 0.593 210.15 194.57 208.62 225.21 210.15 194.57 259.54 227.83 177.23 129.97 282.12 186.94 456.18 439.02 0.272 WLQYSEVLH[AR] A

WLQYSEVIHAR QYSATRTVLHAR QDPSYSQVLLPR LGTWFSSGIAHAR WLQYSEVIHAR QYSATRTVLHAR NFGSVNEDPIFK CRGSVNDEPLFK TCRGSVPPPLPDK CRGSVLPAPPLTR GSMDNNSGGEEGGR SACPCRGCWAASR GSGDAAYPGGPFFNLFNLGK VVCLMSAIALPYNGGLRPGV

1152 905 77 77 1152 905 3122 3122 78 40 158 476 1152 582

3.477 1.523 1.743 1.434 3.477 1.523 3.930 2.275 1.744 1.064 1.189 1.020 4.290 1.396

Score E value XCorr Sca¡old Tryptic peptide as deduced from the genomic data Relative probability Query as given by de novo amino acid sequence prediction MS/MS spectrum (see Fig. 2)

Table 1 Peptides that were gathered by GPF were analyzed by Sequest using the result database and the respective MS/MS spectra

113

J. Allmer et al./FEBS Letters 562 (2004) 202^206 Number of hitsa

204

nome, the sequence and the mass. The sequence in turn depends on the local amino acid distribution as does the mass. A genome size of 100 Mb was used for E value calculations which corresponds to the size of the C. reinhardtii genome (assembly 1) that was downloaded from the Chlamydomonas Genetics Center (http://www.biology.duke.edu/chlamy/). A probability below 0.05 is usually considered signi¢cant but for readability reasons 310 log10 (probability) was calculated so that E values above 106 can be considered signi¢cant [23]. To assess the similarity between queries and their corresponding results we introduced a scoring system. The score value would be 1 for a perfect match and decreases with lower identity. The peptides listed in Table 1 meet this signi¢cance criterion. Since the E value and score calculations are merely used to set limiting values we will not present the algorithms deployed here. The exact formulas and their descriptions can be obtained upon request.

3. Results and discussion To identify peptides in genomic databases, GPF starts with an alignment of an amino acid sequence that originates from an interpretation of a MS/MS spectrum using a DeNovoX amino acid sequencing software with a respective genomic database (Fig. 2). Such de novo amino acid sequence predictions often result in more than one peptide sequence per MS/MS spectrum. For GPF analysis we restricted the search to those de novo peptide sequence predictions that were given a relative probability equal to or higher than 0.1 (Table 1) as de¢ned by the DeNovoX software. As proof of principle, three examples are discussed in detail. The functionality of process 1 is illustrated by GPF analyses of de novo sequence predictions evaluated from an MS/MS spectrum that also led to the identi¢cation of a peptide by searching the Chlamydomonas EST and genomic databases using Sequest [20]. The identi¢ed peptide WLQYSEVIHAR is derived from the lhca3 gene product and is not split by an intron in the nuclear lhca3 gene present in sca¡old 1152 of the genomic database. Three de novo amino acid predictions of that MS/MS spectrum (Table 1 ; A) were queried by GPF. GPF found the peptide WLQYSEVIHAR two times in scaffold 1152 which is due to several similar de novo prediction alternatives. The E value together with score calculations by GPF for this peptide indicate that this peptide is the most signi¢cant tryptic peptide sequence among the sequences found by GPF when analyzing the respective sequence queries. The next two examples derive from MS/MS spectra that did not result in signi¢cant Sequest scores using the genomic database but matched when searching the EST Chlamydomonas database. Evaluation of the MS/MS spectra shown in Fig. 2B,C by DeNovoX resulted in the prediction of several amino acid sequences that were used as queries for GPF. These queries resulted in numerous tryptic peptide sequences which were found to be signi¢cant and good candidates for further evaluation and were therefore collected by GPF. Among these sequences peptide NFGSVNEDPIFK as well as peptide GSGDAAYPGGPFFNLFNLGK resulted in E values in conjunction with scores derived for the queries deduced from MS/MS spectra 1B and 1C, respectively (Table 1), that were very promising. Analysis of the respective MS/MS spectra using the Sequest algorithm and a database that contains the peptides found by GPF can be used to prove that the given peptides are represented by the MS/MS spectra from which the de novo amino acid predictions were obtained. This analysis resulted in cross-correlation factors

FEBS 28187 15-3-04 Cyaan Magenta Geel Zwart

J. Allmer et al./FEBS Letters 562 (2004) 202^206

205

Fig. 2. MS/MS spectra of doubly charged precursor ions and the resulting de novo amino acid sequence predictions. (Z = Cys+57 Da.)

(XCorr ) well above 2.5 for the peptides NFGSVNEDPIFK and GSGDAAYPGGPFFNLFNLGK. An XCorr value larger than 2.5 is considered to be signi¢cant for fragmentation spectra of doubly charged precursor ions [6] (Table 1). The Sequest search therefore con¢rmed the GPF predictions. For peptide NFGSVNEDPIFK GPF predicts the following intron^exon

boundary: NFGSVNE-intron-DPIFK. This peptide can be deduced from the lhca5 gene product [20] and is split by an intron in the nuclear gene exactly as predicted by GPF. In this case the amino acid sequence predicted by DeNovoX allowed GPF to determine the correct intron^exon boundary (Fig. 1, Process 2). For peptide GSGDAAYPGGPFFNLFNLGK

FEBS 28187 15-3-04 Cyaan Magenta Geel Zwart

206

J. Allmer et al./FEBS Letters 562 (2004) 202^206

GPF matched sequence fragments AAYPG and NLGK. The peptide sequence was extended to the next tryptic cleavage site on the left border of AAYPG that de¢nes the sequence GSGD using the deduced amino acid sequence as a matrix. However, the mass of the resulting peptide GSGDAAYPGNLGK did not match the mass of the precursor ion. Therefore, the peptide sequence was extended on the right border of AAYPG and/or left border of NLGK again using the deduced amino acid sequence as a matrix. Amino acids are added to prolong the peptide sequence until the mass of the peptide matches the mass of the precursor ion or exceeds it, which then terminates the process (Fig. 1, Process 3). Insertion of sequence GPFFNLF resulted in peptide GSGDAAYPGGPFFNLFNLGK which matches the mass of the precursor ion and thus de¢nes an intron^exon boundary within the peptide as GSGDAAYPG-intron-GPFFNLFNLGK. Searching in the Chlamydomonas EST database with this peptide sequence revealed that it can be deduced from the lhca3 gene product [20]. The coding region for this peptide consists of two exons split by an intron exactly as predicted by GPF. We conclude that our approach enables the identi¢cation of peptides which are split by introns in the genome. In addition, our approach has the ability to verify and annotate mistakes in genomic sequences using mass spectrometric data (Process 4, Fig. 1). We suggest that our software tool can be used to complement Sequest or Mascot search tools when mass spectrometric data are used to search in genomic databases to signi¢cantly increase the number of identi¢ed peptides and proteins. Acknowledgements: This work has been supported by grants of the Federal State of Thu«ringen (Nachwuchsgruppe P£anzenphysiologie) and the Deutsche Forschungsgemeinschaft to M.H.

References [1] Mann, M. and Wilm, M. (1994) Anal. Chem. 66, 4390^4399. [2] Giddings, M.C., Shah, A.A., Gesteland, R. and Moore, B. (2003) Proc. Natl. Acad. Sci. USA 100, 20^25. [3] Mann, M., Hojrup, P. and Roepstor¡, P. (1993) Biol. Mass Spectrom. 22, 338^345. [4] Sunyaev, S., Liska, A.J., Golod, A. and Shevchenko, A. (2003) Anal. Chem. 75, 1307^1315. [5] Perkins, D.N., Pappin, D.J., Creasy, D.M. and Cottrell, J.S. (1999) Electrophoresis 20, 3551^3567. [6] Eng, J., McCormack, A.L. and Yates, J.R. (1994) J. Am. Soc. Mass Spectrom. 5, 976^989. [7] Hutchinson, G.B. and Hayden, M.R. (1992) Nucleic Acids Res. 20, 3453^3462. [8] Xu, Y., Mural, R., Shah, M. and Uberbacher, E. (1994) Genet. Eng. 16, 241^253. [9] Burge, C. and Karlin, S. (1997) J. Mol. Biol. 268, 78^94. [10] Claverie, J.M. (1997) Hum. Mol. Genet. 6, 1735^1744. [11] Henderson, J., Salzberg, S. and Fasman, K.H. (1997) J. Comput. Biol. 4, 127^141. [12] Reese, M.G., Eeckman, F.H., Kulp, D. and Haussler, D. (1997) J. Comput. Biol. 4, 311^323. [13] Pertea, M., Lin, X. and Salzberg, S.L. (2001) Nucleic Acids Res. 29, 1185^1190. [14] Larsen, T.S. and Krogh, A. (2003) BMC Bioinformatics 4, 21. [15] Majoros, W.H., Pertea, M., Antonescu, C. and Salzberg, S.L. (2003) Nucleic Acids Res. 31, 3601^3604. [16] Burset, M. and Guigo, R. (1996) Genomics 34, 353^367. [17] Guigo, R. (1997) Comput. Chem. 21, 215^222. [18] Guigo, R., Agarwal, P., Abril, J.F., Burset, M. and Fickett, J.W. (2000) Genome Res. 10, 1631^1642. [19] Reboul, J. et al. (2003) Nat. Genet. 34, 35^41. [20] Stauber, E.J., Fink, A., Markert, C., Kruse, O., Johanningmeier, U. and Hippler, M. (2003) Eukaryot. Cell 2, 978^994. [21] Choudhary, J.S., Blackstock, W.P., Creasy, D.M. and Cottrell, J.S. (2001) Proteomics 1, 651^667. [22] Shevchenko, A. et al. (1996) Proc. Natl. Acad. Sci. USA 93, 14440^14445. [23] Perkins, D.N., Pappin, D.J., Creasy, D.M. and Cottrell, J.S. (1999) Electrophoresis 20, 3551^3567.

FEBS 28187 15-3-04 Cyaan Magenta Geel Zwart

Publications – Publication 4

1.4

76

Publication 4

Allmer, J., Naumann, B., Markert, C., Zhang, M. and Hippler, M. (2006) Mass spectrometric genomic data mining: Novel insights into bioenergetic pathways in Chlamydomonas reinhardtii. under review

This publication introduces a new version of the GenomicPeptideFinder which is significantly faster than the version introduced in Publications 1 and 2. This version of GPF was used locally and on a 256 processor cluster at the University of Pennsylvania. Processing of one peptide now took only about three minutes on a regular PC. This is about a 250 fold improvement in speed. Analysis was performed on mass spectrometric data mostly derived from Chlamydomonas reinhardtii thylakoid fractions that were separated by 1D SDS PAGE. It was shown that GPF not only enables identification of peptides split by an intron when deduced from genomic DNA, but may also add confidence to protein identification by complementing database search results. The online supplement to this publication can be found in Appendix A.

Programming, database design, data analysis, data visualization, and computational processing were performed by the author.

Publications – Publication 4

77

Mass spectrometric genomic data mining: Novel insights into bioenergetic pathways in Chlamydomonas reinhardtii

Jens Allmer1, Bianca Naumann1, Christine Markert, Monica Zhang and Michael Hippler1*

Plant Science Institute, Department of Biology, University of Pennsylvania, Philadelphia, PA, 19104, USA

Running title: Mass spectrometric genomic data mining

Keyword: genomic data mining, proteomics, mass spectrometry, error-tolerant search, Chlamydomonas reinhardtii

1

Present address: Department of Biology, Institute of Plant Biochemistry and

Biotechnology, University of Muenster, Hindenburgplatz 55, 48143 Muenster, Germany

*corresponding author, email: [email protected], Phone: 49 251 8324790, Fax: 49 251 8328371

Publications – Publication 4

78

Abstract A new high throughput computational strategy was established that improves genomic data mining from mass spectrometric experiments. Tandem mass spectrometric (MS/MS) data was analyzed by the Sequest search algorithm and a combination of de novo amino acid sequencing in conjunction with an error-tolerant database search tool, operating on a 256 processor computer cluster. The error-tolerant search tool, previously established as GenomicPeptideFinder (GPF), enables detection of intronsplit and/or alternatively spliced peptides from MS/MS data when deduced from genomic DNA. Isolated thylakoid membranes from the eukaryotic green alga Chlamydomonas

reinhardtii

were

separated

by

one-dimensional

SDS

gel

electrophoresis, protein bands were excised from the gel, digested in-gel with trypsin and analyzed by coupling nano-flow liquid chromatography with MS/MS. The concerted action of Sequest and GPF allowed identification of 11735 unique peptides. In total 1094 peptides were identified by GPF analysis alone including 698 intron-spit peptides, resulting in the identification of novel proteins, improved annotation of gene models

and

discovery

of

alternative

splicing.

Publications – Publication 4

79

1. Introduction

Proteomic research is driven by the development of new mass spectrometric technology and bioinformatic tools to handle and evaluate the mass spectrometric data. In recent year’s new mass spectrometer were developed permitting ever more sensitive, fast and precise peptide and protein analysis. In line with this, the amount of data stemming from proteomic experiments is becoming humongous. Today’s bottleneck seems to be the evaluation of this data. A well established way to identify peptides and proteins from mass spectrometric data is to search a protein database and match the mass spectra to the database entries. This is realized in several search engines like Sequest [1], Mascot [2, 3], Sonar [4], GutenTag [5] and other novel approaches [6-8]. One obvious limitation of all these algorithms is that the sequence searched for must be present in the database. Another problem is the high number of false positive identification. Several tools have been developed, to make the findings more reliable [4, 9-13]. A different way of addressing the mass spectra is to directly deduct the amino acid sequence from the information contained therein. A number of these programs have been described and are in use today [14-19]. These programs face other limitations. They are usually computational intensive and dependent on high quality spectra [20, 21]. For these reasons, they are quite limited in practice. New tools might increase both speed and reliability of de novo predictions [21-23]. Today, de novo amino acid sequencing is far from being perfect, but using the predictions for an error tolerant search in the genomic database might reveal new information. For advancing annotation, it is desirable to map back the proteomic information to the corresponding

Publications – Publication 4

80

genome [24] which can be achieved by this combination of tools. We will present results that take advantage of such an approach, using PEAKS [14] to perform de novo amino acid sequencing from the mass spectra and the GenomicPeptideFinder (GPF) [25] to search the translation of the genomic database of the unicellular eukaryotic green alga Chlamydomonas reinhardtii in an error tolerant fashion. The aim was to employ mass spectrometric data for genomic data mining. The errortolerant search tool, GPF, was shown to enable detection of intron-split and/or alternatively spliced peptides from MS/MS data when deduced from genomic DNA [25]. From the EST and genomic data obtained for model organism like, human, mouse or Arabidopsis it is obvious that roughly half of the gene products that are encoded by a genome cannot be annotated from EST data [26-29]. Additionally it is suggested that alternative splicing adds another enormous factor to proteome diversity in eukaryotes with complex genomes. It is assumed that 25% of the peptides found via mass spectrometry from eukaryotes contain introns when mapped back to the genome [30]. It is further supposed that about 40 % of the human genes are alternatively spliced [31]. Thus genomic data mining employing mass spectrometric information will be important and indispensable for the identification of protein diversity in these organisms. For the analysis of the mass spectrometric data we established a platform where MS/MS spectra were analyzed by the Sequest search algorithm and a combination of de novo amino acid sequencing in conjunction with our error-tolerant database search tool, operating on a 256 processor computer cluster. This highthroughput platform was applied to identify proteins, facilitate annotation of gene models as well as to recognize proteins that originate from alternative splicing. The

Publications – Publication 4

81

analysis was done with C. reinhardtii which is an important eukaryotic model system for the investigation of fundamental molecular processes, for instance, operating in bioenergetics [32] and motility [33] and is an emerging model system for proteomic research [34]. Chlamydomonas has complex gene structures with several exons per gene. It is therefore an appropriate system to address the question of the feasibility of genomic data mining using mass spectrometric data. We processed mass spectrometric data originating from thylakoid membranes isolated from arginine auxotrophic cell wall less Chlamydomonas cells. thylakoid membranes represent, beside the inner and outer envelope membranes, the third membrane system of the chloroplast and harbor the photosynthetic machinery. It is commonly accepted that chloroplasts in green algae and plants have evolved from a cyanobacterial endosymbiont. The majority of the chloroplast proteome is encoded by the nucleus, whereas the chloroplast genome itself encodes for slightly more than 100 proteins and RNA molecules. The nuclear encoded chloroplast proteins are synthesized as precursors in the cytosol including an N-terminal transit peptide sequence that marks targeting of the protein to the chloroplast. The N-terminal transit sequence is proteolytically removed after import of the protein into the chloroplast. From an algorithm that aimed to predict the presence of chloroplast transit peptide sequence in all the proteins encoded by the Arabidopsis nuclear genome, 4225 proteins have been suggested to be localized in the chloroplast [35]. Among these proteins, about 520 are predicted to have at least one ore more transmembrane domain and are therefore localized either to the outer/inner envelope or the thylakoid membrane system. Proteomic studies performed in vascular plants indicated the presence of more that 700 non-redundant proteins in thylakoid and outer/inner envelope [36-40]. So far, no large scale thylakoid membrane proteomic study has been performed with green

Publications – Publication 4

82

algae. Therefore we aimed to use our newly developed computational genomic data mining strategy to elucidate the thylakoid proteome of the green algae C. reinhardtii. Our study identified numerous new plant, as well as Chlamydomonas specific, proteins. The data also demonstrated that the high-throughput platform can be used to detect intron-split and/or alternatively spliced peptides from MS/MS data. The findings also indicated that the concerted efforts of different algorithms improve protein identification.

Publications – Publication 4

83

2. Materials and Methods 2.1 Mass spectrometry Instrument set-up for liquid chromatography (Ultimate system, LC-Packings) and mass spectrometry (LCQ-Deca XP plus, ThermoFinnigan) , protein in-gel tryptic digest, sample handling and SDS-PAGE analysis were performed as described [41]. Protein samples originated from isolated isotopically labeled iron-sufficient or unlabeled iron-deficient thylakoid membranes. Isotopic labeling of proteins was achieved by growth of arginine auxotrophic Chlamydomonas cell wall less cells (CW15) in the presence of either

13

C-labeled or

12

C-unlabeled arginine as described

[41]. Protein identification data were further used for quantitative peptide profiling (Naumann, Allmer and Hippler, manuscript in preparation).

2.2 Sequest Widely used settings were used when searching databases with Sequest. Our significant threshold for Xcorr values were 1.75 for singly charged, 2.5 for doubly charged and 3.5 for triply charged parent ions. We used Sequest to match all acquired spectra against several databases to get the maximum result. These fasta-files contained the following sequences which all stem from Chlamydomonas reinhardtii:



The chloroplast and mitochondrion proteomes



The JGI gene models



All available EST sequences



The six-frame translation of the genome

When matching GPF results against the corresponding dta-files, an enzyme was used that cuts after each letter J. Since J does not code for any amino acid and it is therefore absent from the sequences this procedure ensures that none of the predictions are actually cut, but accepted as full length sequences.

Publications – Publication 4

84

2.3 PEAKS Since we found that only little information could be gained from triply charged peptides, we completely excluded all charges higher than 2 from PEAKS processing. The parent ion tolerance was set to 1 Dalton and the fragment ion tolerance to 0.5 Dalton. For each dta-file 5 predictions were reported. We allowed the Cysteines to be Carboxy-amido-methylated. Since we used heavy arginine in the experiments involving the one dimensional SDS-PAGES, we allowed arginines to be 6 Daltons heavier.

2.4 Query Creation The GenomicPeptideFinder does not directly accept PEAKS results as input. A level of indirection was introduced so that new de novo sequencing algorithms can easily be implemented to work with GPF. Furthermore some filtering can be performed at this step. We accepted all those predictions from PEAKS as input, that contained at least eight amino acids in their sequence and whose score was higher than 10%. These predictions were extracted from the fas-files created by PEAKS. In that file format modifications are not reported. The mass of the peptides are however adjusted to the assumed modification. If modifications were assumed by PEAKS, the mass of the GPF query was adjusted accordingly, i. e. the mass calculated from the peptide sequence was used for searching.

2.5 GPF All searches, whether performed on our Windows PC or on the LINIAC cluster, used the same GPF core compiled in the appropriate environment. There is thus no

Publications – Publication 4

85

difference in algorithm employed while running in different environments. When matching the queries against the genome we enforced, that in the first search at least 5 amino acids of the prediction must exactly match the sequences found in the database. For the second search, within a window of +- 700 amino acids around hits from the first search, at least 3 consecutive amino acids had to match. The difference of calculated and measured precursor mass was restricted to 1000 ppm. Tryptic cleavage sites (R or K) had to be present on both sides of the peptide sequence. These settings only allow us to find peptides of a minimal length of 8 amino acids. Shorter peptides are always missed. Also other peptides, which do not include stretches of 5 or more correct amino acids are missed.

2.6 AutoMS AutoMS is a software designed and programmed in Microsoft Visual Studio .net using C++ programming language. Where possible, ANSI C++ was used. It runs under most Microsoft Windows operating systems. AutoMS automates the programs used in this study. It enables batch processing for hundreds of tasks which can easily be set and if required individually customized. Some data filtering can be performed and the significant results are reported in a plain text file which is easy to import in our database. All settings and options for the individual programs can be set in AutoMS as well. The settings used are described in the individual sections of the tools employed. File locations are adjusted to achieve a structured file repository. The code is robust enough to allow the software to run without interruption. GPF was run outside of the AutoMS environment. Since all information was contained in a specific directory structure it was however no problem to integrate the results after processing on the cluster. Picking up further processing downstream of GPF did not pose any

Publications – Publication 4

86

problem due to the same fact. The software is still in its beta phase, but the executable is available upon request.

2.7 Confidence calculation for PEAKS-GPF-Sequest findings We digested the six frame translation of the Chlamydomonas reinhardtii genome with trypsin in silico. The resulting tryptic peptides were sorted into 1 Dalton sized bins. The count of peptides per bin was stored. We than plotted the bins against the number of peptides in the bins (Supplementary Fig. 28 online). SigmaPlot was used to fit the mass range (300Da-4000Da) that is usually best resolved in the mass spectrometer. Equation 1 is described as the best fit.

M Of =

3.31343 ⋅10 6 5.77642 ⋅10 9 + − 902.428 x x2 Equation 1

x:

Peptide mass in Dalton

M Of : Frequency of the occurrence of a given peptide mass

Given equation 1 and the complete number of tryptic peptides resulting from the in silico digest, we calculated the chance for a random occurrence of a peptide with mass x.

PE = PE :

M Of

24177277 Chance for random occurrence of a peptide with mass x

Equation 2

Publications – Publication 4

87

To assess the apparent database size for a peptide with mass x we first calculated the mean number of amino acids that can fit the peptide mass.

N AA =

x M AA

Equation 3

N AA : Mean number of amino acids per peptide with mass x

M AA : Mean amino acid mass (assumed to be 100)

The apparent database size is thus given by the number of peptides within the mass window and the mean number of amino acids per peptide.

DB App = M Of ⋅ N AA

Equation 4

DB App : Apparent database size

Given the apparent size of the database and the number of matching amino acids in the peptide (here 5, or 5 + 3 for intron-split peptides) we calculated the probability for a random occurrence.

PA = PA :

DB App

205

Equation 5

Random occurrence of a string of amino acids in the apparent database

To calculate the random occurrence of any peptide only dependent on the mass of the peptide we combined the two probabilities.

P = PA ⋅ PE

Equation 6

Publications – Publication 4

88

Equation 6 can be substituted with the appropriate variables and then be simplified to yield equation 7.

P=

1078 1.24 3.54 ⋅10 −4 + 2 + + 2.63 ⋅10 −11 x + 1.93 ⋅10 −7 3 x x x

Equation 7

This assessment of probability is just an approximation, but it appears accurate enough, to conclude that any peptide found with the combination of PEAKS, GPF, and Sequest can be considered significant (P < 0.05). Measured peptide masses are typically in-between 1000 and 3000 Da corresponding to probabilities in the range of 3.2*10-6 to 6.8*10-7 Dalton.

2.8 Database The database was designed in Microsoft Access. Microsoft Visual Basic was used to program all necessary features. The aim of the database is to map sequences found from distinct methods back to a sequence database. The table-space and the relations can be seen in Supplementary Figure 27.We used the JGI gene models, the mitochondrion, and the plastid proteomes of Chlamydomonas reinhardtii for sequence information in our database. All Sequest, GPF, and PEAKS findings were mapped to this sequence repository present in one table of the database. Automatic procedures were devised to extract significantly identified proteins from the database. Significantly identified proteins were those that either had more than one supporting peptide or those that had one supporting peptide which in turn was supported by multiple methods. Of course the significance thresholds had to be met for each of these supporting peptides as well. The significant proteins were kept in experimental

Publications – Publication 4

89

context. That means that only peptides from a certain band or a number of related bands were combined to find significant proteins. The threshold to combine the peptide pool of multiple SDS-PAGE bands was +/- 5% of their molecular weight. Peptides that would be the single supporting source for a given Protein were checked for sequence complexity. If there was less than 40% variability in the sequence, the result was discarded. A number of functions were designed to automatically extract more information on each of these proteins via the internet. All data leading to significant protein identification can be found in the online supplement, where we included a zipped instance of an Access database for download.

Publications – Publication 4

90

3. Results It was our aim to integrate de novo amino acid sequence predictions and GPF in the standard computational processing workflow for mass spectrometric data analysis. To achieve this de novo amino acid predictions were used as search strings in an error tolerant genomic database search, performed by GPF, to detect intron-split and/or alternatively spliced peptides when deduced from genomic DNA . Using this platform we explored the thylakoid proteome of the green alga C. reinhardtii, to identify proteins, facilitate annotation of gene models as well as to recognize proteins that may originate from alternative splicing. The thylakoid membranes were isolated from ironsufficient and iron-deficient arginine auxotrophic cell wall less cells that were grown in the presence of either isotopically labeled 13C6 or unlabeled 12C6 arginine. The aim was to perform comparative quantitative proteomics to elucidate adaptation to irondeficiency (Naumann, Allmer, Hippler, manuscript in preparation). Labeled and unlabeled thylakoid proteins were mixed and separated by one-dimensional gel electrophoresis, protein bands were excised from the gel, digested in-gel with trypsin and analyzed by liquid chromatography coupled to tandem mass spectrometry (LCMS/MS). In total, 126 bands were analyzed from four independent SDS-PAGE fractionations. We introduced a new approach in computational processing designed to gain maximum information from the mass spectrometric data gathered which is synergetic with other approaches described earlier [42]. The complete workflow downstream of spectra acquisition is summarized in Figure 1. As depicted therein, six different tasks were performed on the spectra before the results were imported into a database. First the spectra were extracted from the raw-file produced by the software of the mass spectrometer. Then Sequest searches were performed on several databases. The

Publications – Publication 4

91

significant results were extracted and stored in the database. So far, the process was well established. In addition to this we also performed de novo amino acid sequence predictions using the PEAKS algorithm [14]. All recorded singly and doubly charged mass spectra were submitted to de novo amino acid analysis by PEAKS. The predictions were converted to queries for the GenomicPeptideFinder and searched against the six-frame translation of the genomic database of C. reinhardtii. To close the loop, the GPF predictions were then evaluated by Sequest against the original spectra. The process is illustrated with an example in Figure 2 that shows a recorded spectrum, its b- and y-ions, and the sequences as given by PEAKS and GPF respectively. The PEAKS prediction permitted, although varying at N- and Cterminus, identification of a peptide sequence with significant Sequest Xcorr factor when searched against the translation of the genomic database with GPF. This result was also present in both, the EST data and the gene models, which underlines the significance of this finding. Sequest and PEAKS employ distinct methods to analyze the data contained in the recorded mass spectra. The Sequest algorithm uses cross correlation of measured mass spectra to spectra computationally derived from peptide sequences in a database. PEAKS obtains sequences from the mass spectra themselves employing a de novo amino acid sequencing algorithm. Since the two methods are distinct, same results are complimentarily and considered to raise the confidence in a particular finding . Consecutively, the significant results were filtered and imported into a database. Many of these steps involved manual input and adjustments to options for the various processes. This proved very tedious and error prone. Therefore, we devised a program named AutoMs that automated these tasks. Using AutoMS we

Publications – Publication 4

92

were able to make full use of de novo amino acid sequencing by PEAKS and error tolerant database search with GPF in a high-throughput experimental. The use of a 256 processor cluster at the University of Pennsylvania (The LINIAC cluster), enabled faster processing of the de novo amino acid sequence predictions with GPF. In total 435475 de novo amino acid sequence predictions, originated from the mass spectrometric analysis of the one dimensional PAGE bands. These predictions were run against the six-frame translation of the genomic database and produced 76,996,817 GPF predictions. After validation of these predictions with Sequest, 1094 peptides could be imported into the database. Among these peptides 698 were potentially split by an intron. The concerted action of Sequest and GPF allowed identification of 11735 non redundant peptide sequences. Less than 1% (0.76) of the sequences, identified by Sequest, were predicted to have the same sequence by PEAKS, directly. Using GPF to map the PEAKS predictions back to the genome to find matching sequences in-between Sequest and this combination significantly increased the number of matches (~3% (3.23)). Non intron-split peptides identified by Sequest and PEAKS/GPF/Sequest analysis were taken for protein identification purposes. Proteins that matched with at least two peptides identified by the Sequest algorithm were considered significant. We further

considered every

protein

that

was

recognized

by at least

one

PEAKS/GPF/Sequest peptide, matching the Sequest significant criteria, as significant identification. Given the mass of the peptide within +/- 1 Dalton and an exact matching of 5 amino acids, we calculated that any peptide found with the combination PEAKS, GPF and Sequest can be considered to be significant in respect to the size of the Chlamydomonas genomic database (P < 0.05; see Material and Methods).

Publications – Publication 4

93

351 of the non intron-split peptide sequences identified by both, Sequest and GPF could be directly mapped to existing gene models, thus underlining the correctness of the gene model prediction in those areas. Interestingly, several lower abundant proteins were identified by a peptide, as the single supporting evidence (see Table 1) like a novel light-harvesting protein not described on protein level before (LhcbP1, C_20371), an ATP-sulfurylase (C_20033), a DegP type serine protease (C_1010043), the Stt7 serine/threonine protein kinase (C_1150050) [45], a hypothetical glutathione S-transferase-related protein (C_1470041), a hypothetical Rhodanese like protein (C_20358), a hypothetical protein containing a CreA domain (C_1520015), a FKBP-type peptidyl-prolyl cis-trans isomerase (EC 5.2.1.8) (C_1630014), a hypothetical protein upregulated under low carbon dioxide (C_530007), two other hypothetical protein (C_250026, C_1450004) and a putative lumen protein, related to OEE3, PsbQ (C_180041) and other peptides (see Supplementary Tables 1, 2, 3 and 4 online). This was also true for 24 peptides that were split by an intron when deduced from genomic DNA. The corresponding intron-exon structure for these gene models could thus be validated for the associated area and thereby facilitated gene model annotation. Among the recognized proteins, two intron-split peptides identified the corresponding proteins as sole supporting peptide hits, one peptide matched with NADH:ubiquinone oxidoreductase B17.2-like subunit (C_140183) (Supplementary Table 3 online) and another with a putative chloroplast inner envelope protein (C_320089) (Table 1). Other intron-split peptides could be used to correct corresponding gene models (see Supplementary Fig. 2 online) However, only a small fraction of peptides that were predicted to have introns when translated from genomic DNA, mapped directly to existing gene models. There are a number of reasons for

Publications – Publication 4

94

this; (i) the entire peptide or one part of the intron-split peptide could be located outside of any existing gene model; (ii) it could represent alternative splicing so that both flanking sequences can be found in the predicted gene model; (iii) the gene model could be wrong or (iv) the peptide represents a false positive identification. We indeed found numerous intron-split peptides that matched outside of existing gene models or that matched only with one part with an existing model. In the latter case several models were modified according to intron-split peptides are shown in the online supplement. In total 90 models were altered and 26 of these are presented (Supplementary Figures 1-26 online). In some cases, one part of the intron-split peptide matched to a tryptic peptide in a gene model that was not intron-split. However, then a peptide stemming from the same dta-file was also identified by the PEAKS/GPF/Sequest search as non-intron peptide. In the following example intronsplit peptide QWA-NYTSVLVTAPEGK (Xcorr 4.42, doubly charged) matched with the sequence NYTSVLVTAPEGK, right of the intron-split, to gene model Peptidylprolyl cis-trans isomerase (C_90033). A corresponding non-intron peptide SINANYTSVLVTAPEGK (Xcorr 4.45, doubly charged) matched perfectly to the model. The two peptides differed in sequences QWA and SINA. The three amino acid sequence corresponded to the left part of the intron-split peptide and had the exact molecular mass as compared to the four amino acid sequence SINA, so that both were recognized by the algorithm harboring significant Xcorr scores. The fact that the two Xcorr factors were so similar points to the fact that the N-terminus of the peptides were not well resolved in the MS/MS spectra, so that miss-interpretation by de novo amino acid sequencing is not surprising. In another example, the intron-split peptide EEIGAEDGAGPISWADLIVLAAK (Xcorr 4.79, doubly charged) matched with its right part the gene model for L-ascorbate peroxidase (C_1860009). The corresponding

Publications – Publication 4

95

tryptic peptide in L-ascorbate peroxidase was the non-intron split sequence IDAAGAEDGAGPISWADLIVLAAK. Interestingly the non-matching part IDAA had exactly the same mass as EEI, so that the masses of the PEAKS/GPF/Sequest and the L-ascorbate peroxidase peptides were identical and both peptides were recognized by the algorithm. In principle the later peptide should have been identified as nonintron-split peptide by PEAKS/GPF/Sequest search but it was not. In both examples, the non-intron split peptide sequences as well as the gene models were supported by EST-data, whereas the intron-split peptides as well as the putatively altered model were not, indicating that these intron-split peptide identifications were possibly artifacts. Overall we found about 151 peptides where at least eight amino acids of one part of the intron-split peptides could be matched within at least one gene model. Besides correction of gene models, the analysis of the mass spectrometric data also enabled identification of several GPF peptides that potentially originated from alternative splicing. Figure 3A shows the MS/MS spectrum of peptide VNGGPAGEGLDPL-ADDPDTFAELK (Xcorr 3.64). Both parts of the peptide mapped with gene models for Lhcbm3, 4, 8 and 9 (C_2050001, C_1460037, C_1460035, C_1460005). However, in-between the two parts of the peptide an amino acid sequence of 11 amino acids was present (Figure 3B). Therefore the amino acid sequence in-between the PEAKS/GPF/Sequest peptide should be excised. However, the gene models for the different Lhcbm proteins are supported by EST data [46, 47] and mass spectrometric data [48], whereas a version harboring the PEAKS/GPF/Sequest peptide amino acid sequence is not. On the other hand, the mass spectrometric data indicate that a version of an Lhcbm gene product exists, that harbors the fused amino acid sequence (see Figure, 3B), which might represent a novel alternatively spliced form. It is well possible that the novel version of Lhcbm is

Publications – Publication 4

96

expressed under iron-deficiency. The non-spliced peptide VNGGPAGEGLDPLYPGESFDPLGLADDPDTFAELK is part of an amino acid sequence motif in Lhcb proteins (L18 see underline) that, along with a hydrophobic recognition element facilitates posttranslational binding to the signal recognition particle in chloroplasts and is required for translocation of the protein to the thylakoid membrane [49]. It should be of note that all Chlamydomonas Lhcbm proteins, so far present in the genome, possess this motif [47]. However, there are Lhcb proteins that are missing a large part of this motif, like the LHCII-type I from Dunaliella tertiolecta [50]. Interestingly, the shorter L18 sequence in this protein, which is

DPLGLADDPDTFAELK, is almost identical to the

DPLADDPDTFAELK sequence which is present in the alternatively spliced Chlamydomonas Lhcbm protein. Recent work emphasized the importance of posttranscriptional control for Lhcb expression in Chlamydomonas [51, 52]. It is of note that we also found evidence for alternative splicing for Lhcb4, although the identified peptide could also be explained by non-tryptic cleavage (see Supplementary Fig. 3 online). Alternative splicing appears to be a novel and so far undiscovered feature of control for light-harvesting protein expression in Chlamydomonas. Further experiments will be necessary to validate this suggestion. A total of 278 proteins were identified, wherein 45% are potentially chloroplast-located, 27% are expected to be located in the mitochondria and 28% have other locations (see Supplementary Tables 1-4 and database online). Within this dataset 26 gene models have been altered in accordance with GPF and Sequest analysis (see Supplementary Figures 1-26). The co-migration of mitochondrial inner membrane proteins with purified thylakoid membrane preparations has been shown earlier [53] and is probably due to the fact that mitochondria and the single

Publications – Publication 4

97

Chlamydomonas chloroplast are tightly connected to each other via the cytoskeleton. This explains the high percentage of mitochondrial proteins in our dataset. We used the JGI Chlamydomnas genome annotation site as well as the Plastid Proteome Database [54] for help in locating of identified proteins. In addition we employed the ChloroP [55], Predator [56] and TargetP [35] algorithms for identification of putative transit and signal peptides, respectively. Proteins that had chloroplast transit peptides, as recognized by ChloroP, were automatically sorted into the chloroplast. Proteins were identified to have a signal peptide by Predator were located into the mitochondrion. For functional annotation we used the JGI Chlamydomnas genome annotation site and the MapMan functional categories [57]. Within the chloroplast dataset 58 of the identified proteins are involved in thylakoid (cyclic) electron transfer, light-harvesting and ATP synthesis (Supplementary Tables 1 and 2 online). Seven of these proteins are not well characterized on protein level. Among them are two putative oxygen enhancer proteins, one related to PsbP (C_490070) and the other one related to PsbQ (C_180041), LhcSR1 and LhcSR3 (C_13190001), a putative new light-harvesting protein LhcbP1 (C_20371) and an early light-induced protein (C_770034). The later four proteins are putative pigment binding proteins but their exact function and location is unknown. For LhcSR1 and LhcSR3, it has been shown that mRNA expression levels were induced under various stress conditions [47]. As a second largest class, we identified 25 hypothetical proteins (Supplementary Table 2 online). Such a high portion of hypothetical proteins among all proteins identified was also described for thylakoid membrane proteins from Arabidopsis [36, 37, 40]. One of these hypothetical proteins was encoded in the chloroplast (orf1995). Deletion experiments of orf1995 in the chloroplast genome of Chlamydomonas as well as in tobacco have demonstrated that this gene encodes an essential protein. Fourteen of the

Publications – Publication 4

98

nuclear encoded hypothetical proteins were already annotated in the PPDB database or have close orthologs in vascular plants. Herein six gene products encoded by gene models C_320089, C_1470041, C_490057, C_80027, C_160173*, (altered model, see Supplementary Fig. 18 online) and C_530007 have not been described on protein level before. For the later gene model, C_530007, mRNA expression levels appeared to be induced under low CO2 in C. reinhardtii [58]. Gene product of gene model C_490057 is a notch-less like WD repeat-containing protein. A common function of all WD-repeat proteins is the coordination of multi-protein complex assemblies, where the WD-repeat units serve as a rigid scaffold for protein interactions [59, 60]. C_80027 is a putative aspartate aminotransferase. Seven of the nuclear encoded hypothetical proteins are either Chlamydomonas specific or have close orthologs in cyanobacteria or bacteria. Hypothetical proteins encoded by gene models C_1350028, C_440066 and C_540056 possess no known domain features and are not well conserved in other organism. Gene product of gene model C_1520015 shares a CreA domain with bacterial proteins, the function of which is unknown. The function of a gene product that harbors a proline rich domain, encoded by model C_200177 and C_200178, is also unknown. An S-isoprenylcysteine O-methyltransferase related enzyme (C_670031) and a conserved hypothetical protein (C_1450004) were recognized, both of which have a nearest relative in the cyanobacterium Prochlorococcus marinus, subspecies str. CCMP1375 [61] and subspecies str. MIT9313 [62], respectively. Another hypothetical protein, encoded by the altered gene model (C_740050*, see Supplementary Fig. 1 online) contains a kinase domain and has putative orthologs in human. This protein possesses more than 1000 amino acids and is probably only found as a fragment in our SDS-PAGE system. The third largest class of proteins is represented by the category redox protein, oxidative

Publications – Publication 4

99

defense and stress response (7). A protein disulfide-isomerase RB60 (C_390061) was identified [63] with eleven PEAKS/GPF/Sequest peptides and seems therefore to be an abundant thylakoid membrane protein, comparable to the abundance of proteins forming the photosynthetic apparatus (see Supplementary Table 1 online). Interestingly four different Peptidyl-prolyl-cis-trans isomerases were identified (C_30247, C_30248, C_90033, C_1630014) as well as a putative L-Ascorbate Peroxidase (C_1860009) and 2-cys Peroxiredoxin (C_200197). Five enzymes of terpenoids and tetrapyrrole metabolism were identified: Protoporphyrinogen oxidase (C_330078), Phytoene desaturase (C_490019), Magnesium chelatase H subunit (C_570057), Geranylgeranyl reductase (C_180047) and 3,8-divinyl protochlorophyllide a 8-vinyl reductase (C_1330031). Five proteins (C_270019, C_ 2350003, C_590080, C_1800002 and C_2200010) which are implicated in transport of ions and small organic molecules are probably all located in the inner envelope membrane. As described for Arabidopsis [36], the proteins potentially targeted to the thylakoid membrane were not enriched in proteins related to DNA organization, transcription and translation. Herein we identified a subunit of the RNA polymerase (rpoC2 [64]) and a putative helicase-like transcription factor-like protein (C_12160001). There is evidence that the protein disulfide isomerase RB60, that was already discussed (see above) acts as a regulator of chloroplast translational activation [63]. The other proteins identified could be catalogued in following categories: protein fate (C_1620016, C_10225, C_1010043, C_270022), Calvin Cycle (RbcL, C_280107), N, S, amino acid and nucleotide metabolism (C_20057*, C_1390010), development and signaling function (C_1150050), hormone and lipid metabolism (C_16470001*) and other functions (C_460094, C_1330014, C_670059, C_30013,

Publications – Publication 4

100

C_170036). Remarkably we also found Chlamyopsin (C_3230005) in this dataset [65]. For the proteins potentially localized to the mitochondria (Supplementary Table 3 online) the majority of them are involved in respiratory electron transfer, ATP synthesis (31). Herein we identified thirteen subunits of the NADH:ubiquinone oxidoreductase (complex 1), which were also recognized after purification of the complex by blue-native PAGE combined with SDS-PAGE and subsequent mass spectrometric analysis [66]. Interestingly for complex II we identified a putative succinate dehydrogenase (C_240093) as well as a putative succinate dehydrogenase cytochrome b560 subunit (C_350099), for complex III subunit 6 (C_1370004, C_20033) and for complex IV a small subunit COX90 (C_810006), all of which have not been described on protein level before [67]. For complex IV we further identified subunit 5b, for complex III subunits Core1, Core2 and RIP1 and for complex V ten subunits were additionally recognized [68, 69]. Again, hypothetical proteins represented the second largest class (20) followed by proteins engaged in Krebs cycle, photorespiration (6), protein fate (4) and other proteins (Supplementary Table 3 online). From the results described above it is obvious that our data mining strategy allowed identification of numerous lower abundant proteins, which were only identified with one supporting peptide that was also recognized in parallel by PEAKS, GPF and Sequest analysis, therefore rendering the identification significant. We propose that this strategy may allow stringent judgment of proteins that are identified by only a single peptide (one-hit wonders, [70])

Publications – Publication 4

101

4. Discussion A new high throughput computational strategy was established that improves genomic data mining from mass spectrometric experiments. The combination of de novo sequencing from MS/MS spectra in conjunction with GPF performance, operating on the 256 processor cluster, is the first study to the best of our knowledge that takes advantage of de novo amino acid sequencing and error-tolerant database search in a high-throughput mass spectrometric experiment. We propose that our data-mining strategy will be of help to explore nuclear gene structures and identify alternative splicing in eukaryotic organism with complex genomes. Although our primary goal was to identify intron-split and/or alternatively spliced peptides we realized that the identification of non-intron split peptides by PEAKS/GPF/Sequest searches largely and significantly aided protein identification in the course of the experiment. 351 of the non intron-split peptide sequences could be directly mapped to existing gene models, by this means proving the precision of the gene model prediction in those areas. We found about 151 intron-split peptides that matched directly to a gene model with at least eight amino acids of at least one of its two segments. Overall the data resulted into the modification of 26 JGI gene models (see Supplementary Figures 1-26 online), some representing gene model corrections and others pointing to alternative splicing (see Supplementary Methods online how gene models were generally modified). The larger fraction of the intron-split peptides could not be matched with existing models. Several of these peptides might be falsepositive identifications; however, others might reveal identification of new gene products. In future experiments a graphical platform is required to visualize proteomic data, together with the genomic DNA sequence along with EST sequences and other DNA-derived sequence information. Beside gene model prediction based on DNA-

Publications – Publication 4

102

driven algorithms novel and unbiased algorithms are required that weight all information available, including proteomic data, to predict gene models from genomic DNA. Existing database and visualization tools such as AceDB and Artemis may be of help in this context [71]. The discovery of alternative splicing along with gene model corrections and modifications is biological valid and novel information that was made possible by genomic data mining employing mass spectrometric data. To improve genomic data mining by mass spectrometry, the efficiency and speed of GPF should be increased while decreasing the amount of predictions at the same time so that high throughput processing on a single computer will be feasible. The improved version will be used to assess the value of a number of currently available de novo prediction software. As this is a developing field, it will be of use to compare and contrast the available methods.

Publications – Publication 4

103

5. References [1]

Eng, J. K., Mccormack, A. L. and Yates, J. R., J Am Soc Mass Spectr 1994, 5, 976-989.

[2]

Perkins, D. N., Pappin, D. J., Creasy, D. M. and Cottrell, J. S., Electrophoresis 1999, 20, 3551-67.

[3]

Creasy, D. M. and Cottrell, J. S., Proteomics 2002, 2, 1426-34.

[4]

Field, H. I., Fenyo, D. and Beavis, R. C., Proteomics 2002, 2, 36-47.

[5]

Tabb, D. L., Saraf, A. and Yates, J. R., 3rd, Anal Chem 2003, 75, 6415-21.

[6]

Li, D., Fu, Y., Sun, R., Ling, C. X., et al., Bioinformatics 2005, 21, 3049-50.

[7]

Shadforth, I., Dunkley, T., Lilley, K., Crowther, D. and Bessant, C., Rapid Commun Mass Spectrom 2005, 19, 3363-3368.

[8]

Zhang, N., Li, X. J., Ye, M., Pan, S., et al., Proteomics 2005, 5, 4096-106.

[9]

Colinge, J., Masselot, A., Giron, M., Dessingy, T. and Magnin, J., Proteomics 2003, 3, 1454-63.

[10]

Eddes, J. S., Kapp, E. A., Frecklington, D. F., Connolly, L. M., et al., Proteomics 2002, 2, 1097-103.

[11]

Keller, A., Nesvizhskii, A. I., Kolker, E. and Aebersold, R., Anal Chem 2002, 74, 5383-92.

[12]

Magnin, J., Masselot, A., Menzel, C. and Colinge, J., J Proteome Res 2004, 3, 55-60.

[13]

Zhang, W. and Chait, B. T., Anal Chem 2000, 72, 2482-9.

[14]

Ma, B., Zhang, K., Hendrie, C., Liang, C., et al., Rapid Commun Mass Spectrom 2003, 17, 2337-42.

[15]

Fernandez-de-Cossio, J., Gonzalez, J., Satomi, Y., Shima, T., et al., Electrophoresis 2000, 21, 1694-9.

[16]

Dancik, V., Addona, T. A., Clauser, K. R., Vath, J. E. and Pevzner, P. A., J Comput Biol 1999, 6, 327-42.

[17]

Chen, T., Kao, M. Y., Tepel, M., Rush, J. and Church, G. M., J Comput Biol 2001, 8, 325-37.

[18]

Bafna, V. and Edwards, N., 2003.

[19]

Taylor, J. A. and Johnson, R. S., Anal Chem 2001, 73, 2594-604.

[20]

Spengler, B., J Am Soc Mass Spectrom 2004, 15, 703-14.

Publications – Publication 4

[21]

104

Yan, B., Pan, C., Olman, V. N., Hettich, R. L. and Xu, Y., Bioinformatics 2005, 21, 563-74.

[22]

Sun, W., Li, F., Wang, J., Zheng, D. and Gao, Y., Mol Cell Proteomics 2004, 3, 1194-9.

[23]

Han, Y., Ma, B. and Zhang, K., J Bioinform Comput Biol 2005, 3, 697-716.

[24]

Mann, M. and Pandey, A., Trends Biochem Sci 2001, 26, 54-61.

[25]

Allmer, J., Markert, C., Stauber, E. J. and Hippler, M., FEBS Lett 2004, 562, 202-6.

[26]

Okazaki, Y., Furuno, M., Kasukawa, T., Adachi, J., et al., Nature 2002, 420, 563-73.

[27]

Strausberg, R. L., Feingold, E. A., Grouse, L. H., Derge, J. G., et al., Proc Natl Acad Sci U S A 2002, 99, 16899-903.

[28]

Seki, M., Narusaka, M., Kamiya, A., Ishida, J., et al., Science 2002, 296, 1415.

[29]

Reboul, J., Vaglio, P., Rual, J. F., Lamesch, P., et al., Nat Genet 2003, 34, 3541.

[30]

Choudhary, J. S., Blackstock, W. P., Creasy, D. M. and Cottrell, J. S., Trends Biotechnol 2001, 19, S17-22.

[31]

Modrek, B. and Lee, C., Nat Genet 2002, 30, 13-9.

[32]

Hippler, M., Redding, K. and Rochaix, J. D., Biochim Biophys Acta 1998, 1367, 1-62.

[33]

Snell, W. J., Pan, J. and Wang, Q., Cell 2004, 117, 693-7.

[34]

Stauber, E. J. and Hippler, M., Plant Physiol Biochem 2004, 42, 989-1001.

[35]

Emanuelsson, O., Nielsen, H., Brunak, S. and von Heijne, G., J Mol Biol 2000, 300, 1005-16.

[36]

Friso, G., Giacomelli, L., Ytterberg, A. J., Peltier, J. B., et al., Plant Cell 2004, 16, 478-99.

[37]

Peltier, J. B., Ytterberg, A. J., Sun, Q. and van Wijk, K. J., J Biol Chem 2004, 279, 49367-83.

[38]

Ferro, M., Salvi, D., Brugiere, S., Miras, S., et al., Mol Cell Proteomics 2003, 2, 325-45.

[39]

Froehlich, J. E., Wilkerson, C. G., Ray, W. K., McAndrew, R. S., et al., J Proteome Res 2003, 2, 413-25.

Publications – Publication 4

[40]

105

Kleffmann, T., Russenberger, D., von Zychlinski, A., Christopher, W., et al., Curr Biol 2004, 14, 354-62.

[41]

Naumann, B., Stauber, E. J., Busch, A., Sommer, F. and Hippler, M., J Biol Chem 2005, 280, 20431-41.

[42]

Elias, J. E., Haas, W., Faherty, B. K. and Gygi, S. P., Nat Methods 2005, 2, 667-75.

[43]

Grossmann, J., Roos, F. F., Cieliebak, M., Liptak, Z., et al., J Proteome Res 2005, 4, 1768-74.

[44]

Kapp, E. A., Schutz, F., Connolly, L. M., Chakel, J. A., et al., Proteomics 2005, 5, 3475-90.

[45]

Depege, N., Bellafiore, S. and Rochaix, J. D., Science 2003, 299, 1572-5.

[46]

Teramoto, H., Ono, T. and Minagawa, J., Plant Cell Physiol 2001, 42, 84956.

[47]

Elrad, D. and Grossman, A. R., Curr Genet 2004, 45, 61-75.

[48]

Stauber, E. J., Fink, A., Markert, C., Kruse, O., et al., Eukaryot Cell 2003, 2, 978-94.

[49]

DeLille, J., Peterson, E. C., Johnson, T., Moore, M., et al., Proc Natl Acad Sci U S A 2000, 97, 1926-31.

[50]

LaRoche, J., Bennett, J. and Falkowski, P. G., Gene 1990, 95, 165-71.

[51]

Durnford, D. G., Price, J. A., McKim, S. M. and Sarchfield, M. L., Physiol Plantarum 2003, 118, 193-205.

[52]

Mussgnug, J. H., Wobbe, L., Elles, I., Claus, C., et al., Plant Cell 2005, 17, 3409-21.

[53]

Atteia, A., de Vitry, C., Pierre, Y. and Popot, J. L., J. Biol. Chem. 1992, 267, 226-234.

[54]

Sun, Q., Emanuelsson, O. and van Wijk, K. J., Plant Physiol 2004, 135, 72334.

[55]

Emanuelsson, O., Nielsen, H. and von Heijne, G., Protein Sci 1999, 8, 97884.

[56]

Small, I., Peeters, N., Legeai, F. and Lurin, C., Proteomics 2004, 4, 1581-90.

[57]

Thimm, O., Blasing, O., Gibon, Y., Nagel, A., et al., Plant J 2004, 37, 91439.

Publications – Publication 4

[58]

106

Burow, M. D., Chen, Z. Y., Mouton, T. M. and Moroney, J. V., Plant Mol Biol 1996, 31, 443-8.

[59]

Li, D. and Roberts, R., Cell Mol Life Sci 2001, 58, 2085-97.

[60]

Smith, T. F., Gaitatzes, C., Saxena, K. and Neer, E. J., Trends Biochem Sci 1999, 24, 181-5.

[61]

Dufresne, A., Salanoubat, M., Partensky, F., Artiguenave, F., et al., Proc Natl Acad Sci U S A 2003, 100, 10020-5.

[62]

Rocap, G., Larimer, F. W., Lamerdin, J., Malfatti, S., et al., Nature 2003, 424, 1042-7.

[63]

Kim, J. and Mayfield, S. P., Science 1997, 278, 1954-7.

[64]

Maul, J. E., Lilly, J. W., Cui, L., dePamphilis, C. W., et al., Plant Cell 2002, 14, 2659-79.

[65]

Deininger, W., Kroger, P., Hegemann, U., Lottspeich, F. and Hegemann, P., Embo J 1995, 14, 5849-58.

[66]

Cardol, P., Vanrobaeys, F., Devreese, B., Van Beeumen, J., et al., Biochim Biophys Acta 2004, 1658, 212-24.

[67]

Cardol, P., Gonzalez-Halphen, D., Reyes-Prieto, A., Baurain, D., et al., Plant Physiol 2005, 137, 447-59.

[68]

van Lis, R., Atteia, A., Mendoza-Hernandez, G. and Gonzalez-Halphen, D., Plant Physiol 2003, 132, 318-30.

[69]

Funes, S., Davidson, E., Claros, M. G., van Lis, R., et al., J Biol Chem 2002, 277, 6051-8.

[70]

Veenstra, T. D., Conrads, T. P. and Issaq, H. J., Electrophoresis 2004, 25, 1278-9.

[71]

Rutherford, K., Parkhill, J., Crook, J., Horsnell, T., et al., Bioinformatics 2000, 16, 944-5.

Publications – Publication 4

107

Figure 1: Computational processing of the MS/MS spectra acquired by mass spectrometry. Certain data associated with each process such as processing time is presented in the boxes above, each box representing a distinct process. Processing times were calculated for one PC if not indicated otherwise. Most triply charged dtafiles were not submitted to novo prediction. The GPF core is the same in both PC and UNIX distribution. Most GPF processing was done on the Cluster.

Figure

2:

The

figure

shows

the

recorded

MS/MS

spectrum

(bnT_3_063004.1322.1322.2.dta) with labeled b-ions and y-ions. Both, the most significant PEAKS prediction and the GPF prediction are displayed in the graph as amino acid sequences. They are centered amidst their corresponding b-ion peaks. Of the PEAKS predictions, the first two led to the same GPF result. The third and fourth prediction did not lead to a significant GPF result. The fifth prediction was not tested since it was filtered due to low confidence (

Suggest Documents