Methods, tools and current perspectives in ...

2 downloads 0 Views 3MB Size Report
Apr 29, 2017 - Samuel H. Payne5, David Fenyö6,7,#, Bing Zhang3,4,#, D. R. Mani2,# .... The term proteogenomics came into use following a ...... Sanders, W. S., Wang, N., Bridges, S. M., Malone, B. M., Dandass, Y. S., McCarthy, F.
MCP Papers in Press. Published on April 29, 2017 as Manuscript MR117.000024

Methods, tools and current perspectives in proteogenomics

Kelly V. Ruggles1,*, Karsten Krug2,*, Xiaojing Wang3,4,*, Karl R. Clauser2, Jing Wang3,4, Samuel H. Payne5, David Fenyö6,7,#, Bing Zhang3,4,#, D. R. Mani2,#

1

Department of Medicine, New York University School of Medicine, New York, New York 10016

2

The Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142

3

Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas 77030

4

Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas

77030 5

Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington

99354 6

Department of Biochemistry and Molecular Pharmacology, New York University School of

Medicine, New York, New York 10016 7

Institute for Systems Genetics, New York University School of Medicine, New York, New York

10016

*Co-first authors #

Co-senior authors, to whom correspondence should be addressed. D. R. Mani, [email protected], +1-617-714-7483 Bing Zhang, [email protected], +1-713-798-1443 David Fenyo, [email protected], +1-212-263-2216

1 Copyright 2017 by The American Society for Biochemistry and Molecular Biology, Inc.

Abbreviations DNA

Deoxyribonucleic Acid

RNA

Ribonucleic Acid

MS

Mass Spectrometry

MS/MS

Tandem Mass Spectrometry

EST

Expressed Sequence Tag

ORF

Open Reading Frame

SAAV

Single Amino Acid Variant

TCGA

The Cancer Genome Atlas

CPTAC

Clinical Proteomic Tumor Analysis Consortium

FDR

False Discovery Rate

SNP

Single Nucleotide Polymorphism

PTM

Post-Translational Modification

PSM

Peptide Spectrum Match

RPPA

Reverse-Phase Protein Array

GSEA

Gene Set Enrichment Analysis

ssGSEA

Single-sample Gene Set Enrichment Analysis

SNV

Single Nucleotide Variant

nsSNV

Non-synonymous Single Nucleotide Variant

CNA  

Copy  Number  Aberration  

2

Table of Contents

1. Sequence-centric proteogenomics ……………………………………………… 6 1.1 Proteomics aiding genome annotation ………………………………... 7 1.2 Personalized protein sequence databases …………………………… 9 1.3 Sequencing of antibodies ………………………………………………. 10 1.4 Viral infections and activations of transposable elements ………….. 12 1.5 Metagenomics …………………………………………………………… 13 2. Analysis of proteogenomic relationships ……………………………………….. 14 2.1 mRNA-protein correlation ………………………………………………..16 2.2 Genetic control of mRNA and protein abundance …………………… 18 2.3 Relating mutations to PTM and signaling …………………………….. 21 3. Integrative modeling of proteogenomic data …………………………………… 21 3.1 Unsupervised clustering ………………………………………………… 21 3.2 Predictive modeling ……………………………………………………… 24 3.3 Pathway and network modeling ………………………………………... 26 4. Data sharing and visualization …………………………………………………… 28 4.1 Genome-based data sharing and visualization ………………………. 28 4.2 Network-based data sharing and visualization ……………………….. 31 5. Conclusion …………………………………………………………………………. 33

3

Summary With combined technological advancements in high-throughput next-generation sequencing and deep mass spectrometry-based proteomics, proteogenomics, i.e., the integrative analysis of proteomic and genomic data, has emerged as a new research field. Early efforts in the field were focused on improving protein identification using sample-specific genomic and transcriptomic

sequencing

data.

More

recently,

integrative

analysis

of

quantitative

measurements from genomic and proteomic studies have identified novel insights into gene expression regulation, cell signaling, and disease. Many methods and tools have been developed or adapted to enable an array of integrative proteogenomic approaches and in this article, we systematically classify published methods and tools into four major categories, (1) Sequence-centric proteogenomics; (2) Analysis of proteogenomic relationships; (3) Integrative modeling of proteogenomic data; and (4) Data sharing and visualization. We provide a comprehensive review of methods and available tools in each category and highlight their typical applications.

4

The last decade has witnessed the rapid emergence of proteogenomics, a new research field at the interface of genomics and proteomics. The term proteogenomics came into use following a publication by George Church’s group in 2004 describing a proteogenomic mapping technique which harnessed proteomics data to improve genome annotation of Mycoplasma pneumonia (1). The reach of proteogenomics has since expanded with technological advancements enabling rapid and economical high-throughput DNA and RNA sequencing and deep mass spectrometry (MS)-based proteomics. These advancements have proved particularly useful for integrating nucleotide sequencing and MS data from the same sample, where genomic sequencing data can be used to improve protein identification through comprehensive protein sequence database construction. Proteomic data can then be used to demonstrate the validity and functional relevance of novel findings based on large scale RNA and DNA sequencing projects, including coding sequence variants and novel coding transcripts. In addition to sequence-centric proteogenomic data integration, combined quantitative analysis from genomic and proteomic studies have also been used to provide novel insights into multi-level gene expression regulation (2–13), signaling networks (14–17), disease subtypes (10, 12, 13), and clinical prediction (18–20). In this review, we subscribe to an expansive view of proteogenomics, encompassing all areas of proteomic and genomic integrative data analysis and cover the range of tools developed to tackle the associated challenges.

To complement already published review papers that focus on specific sub-domains of the broad proteogenomics research area (21–24), we systematically classified existing methods and tools for various types of integrative proteogenomic studies into four major sections. Section 1 describes aspects of sequence-centric proteogenomics and the combined use of genomic and proteomic data to augment gene or protein annotation (Fig. 1). Section 2 explores relationships between genomic and proteomic data using correlation, with application to deciphering the effect of mutations on signaling (Fig. 2). Section 3 summarizes integrative

5

modeling and analysis of proteogenomic data using statistical and machine learning approaches (Fig. 3). Section 4 discusses genome (Fig. 4) and network visualization (Fig. 5), along with challenges in data sharing. All four sections of the review assume tandem MS (MS/MS) as the core proteomics technology for generating peptide sequence data.

1. Sequence-centric proteogenomics In this section we review several areas of sequence-centric proteogenomics. This includes the integrative analysis of genomic and proteomic data for exome annotation in the form of gene discovery and gene model refinement (Section 1.1); protein level detection of single amino acid variants (SAAVs), insertions, deletions, alternative splice junctions and novel gene fusions in relation to a reference genome sequence (Section 1.2); the application of proteomic sequencing to characterize antibodies (Section 1.3); studying the effects of viral infections and transposons on gene expression in eukaryotic organisms (Section 1.4); and applications of proteogenomics to metaproteomic investigation (Section 1.5).

A concept integral to all five topics in this section is the importance of an inclusive and highquality protein sequence database for peptide identification. In a typical proteomic experiment, peptide MS/MS spectra are interpreted using a database search algorithm that matches and scores the similarity of each experimental spectrum against model spectra constructed from peptide sequences contained in a user supplied protein sequence database (25). This strategy is used, in part, because the fragmentation efficiency of current MS/MS instrumentation is unable to consistently yield spectra from which complete, unambiguous sequences can be interpreted de novo (the current state of automated de novo peptide interpretation have been reviewed elsewhere (26, 27)). To address this, researchers use a protein sequence database, ideally containing all protein sequences one expects to be present in the sample, with minimal

6

irrelevant sequences to reduce false spectral matches and search time. Limiting the number of candidate sequences in the form of a sequence database enables the sequence ambiguity present in a spectrum to be overcome, resulting in high confidence peptide spectrum matches (PSMs) (28).

1.1 Proteomics aiding genome annotation The use of MS-based proteome characterization to aid genome annotation has been widely exploited in various organisms and has been previously reviewed by several groups (22, 29, 30). Here, we highlight some of the pioneering studies using integrative analysis of genomics and proteomics data for genome (re)annotation (See related review (31)).

Early studies integrating proteomic and genomic data date back to before the genome sequencing revolution of the early 21st century. Motivated by the lack of comprehensive and complete protein sequence databases and the emerging availability of nucleotide sequence data in the form of expressed sequence tags (ESTs), researchers interrogated peptide mass spectra using databases obtained by in-silico translation of unassembled ESTs. In 1995 Yates et al. (32) demonstrated the use of nucleotide sequences translated into all six reading frames of amino acid sequence (six-frame translation) to identify mass spectra of unmodified and phosphorylated peptides from human, bovine, E. coli, and S. cerevisiae proteins. The intrinsic ability of searching peptide mass spectra against genomic sequence databases to identify novel, unannotated genes, was employed shortly after by several groups (33–35). Choudhary et al. (36) used information from the recently released human genome draft in 2001 to query all 23 human chromosomes with tandem mass spectra and compare the results to EST database searches. From this, they concluded that MS/MS searching of genomic DNA databases were of limited utility, as the presence of introns in the database prohibited matching exon-spanning peptides and prevented identification in roughly one quarter of the spectra. Additionally, the

7

consensus sequence chosen for the reference genome also prevented the identification of individual single amino acid variants (SAAVs) with EST evidence. These limitations have since been addressed through the incorporation of sample-specific or species-specific alternative splicing and SNPs into the protein sequence database (see Section 1.2)

In 2004, Jaffe et al. introduced the concept of a proteogenomic map as a complementary method for genome annotation, which used evidence of protein expression to predict ORFs in Mycoplasma pneumoniae (1). Since then, a proteomics-based approach to gene annotation model refinement has been successfully applied in both model and non-model organisms (37– 43). Proteogenomic gene annotation is most often carried out by searching peptide mass spectra against a six-frame translation of an associated reference genome sequence database. Peptides identified by this search are then mapped to the existing gene annotation model (Fig. 1). The detection and identification of these peptides provides direct and valuable evidence of protein translation, and have been used to train algorithms for gene model prediction (43).

A crucial prerequisite for genome refinement using MS-based proteomics is sufficient proteome coverage, which became feasible following major improvements in MS instrumentation (ion traps) as well as sample preparation protocols (multi-dimensional fractionation at the peptide level). In fact, Liquid Chromatography (LC)-MS/MS based proteomics utilizing sensitive and fast scanning ion-trap mass analyzers dominated the field of proteogenomics for several years despite the low resolution and mass accuracy of the acquired spectra (37, 38, 40, 44–47). However, the low mass accuracy data acquired by these instruments required large mass tolerances when searching six-frame translation databases, resulting in prohibitively large search spaces, long search times and high proportions of false positive peptide identifications using conventional target-decoy FDR thresholding (48).

8

The invention of a new generation of mass spectrometers revolutionized the field of proteomics, providing high-resolution and high-­‐accuracy MS data at an expanded dynamic range (49, 50). High mass accuracy data has the intrinsic capability to reduce the database search space by allowing for small mass tolerances and an associated decrease in plausible candidate sequences, something of particular importance when searching large genomic six-frame translation databases (31). Further improvements in MS technology enabled the acquisition of high-resolution data at both the MS and MS/MS level without compromising sequencing speed (51–53).

Despite these technological improvements, proteomics still suffers from low sequence coverage even in simple prokaryotic genomes (42). A typical LC-MS/MS proteomics strategy employs the enzyme trypsin to digest proteins by specific cleavage after lysine (K) and arginine (R) amino acids. Any sequence overlap in the resulting pool of peptides will be an artifact of cleavage sites missed by trypsin. Regions of a protein with spacing of K, R residues that are 30 AA’s will tend not to be observed by the mass spectrometer, and peptides with extremes of hydrophilicity or hydrophobicity will not be readily bound or eluted from the LC column, thereby limiting sequence coverage. Recent CPTAC studies report median protein sequence coverage of about 25% across >12,000 proteins in a human cancer cohorts (12) using extensive sample fractionation. Sequence coverage can be increased by use of multiple proteases generating different peptide species at the expense of additional experimental costs. Nucleotide sequencing-based methods to measure gene expression, such as RNA-Seq and Ribo-Seq, typically achieve higher genome coverage and are routinely used to complement the annotation of newly sequenced genomes (54–57).

1.2 Personalized protein sequence databases 9

Reference protein sequence databases, such as the ones from Ensembl or RefSeq, are typically used to identify mass spectra through peptide spectrum matching.

Since these

databases lack sample specific sequence variation, including single amino acid variants (SAAVs), insertions, deletions, alternative splice junctions and novel gene fusions, studies using this approach are unable to identify the corresponding variant peptides present in the MS/MS data. This is a particularly important limitation to consider in cancer studies, where patients acquire tumor specific somatic variation. Analyzing non-synonymous somatic mutations at the proteome level has the potential to yield novel insights into tumor biology (58). To do this, genome and RNA sequencing have been used to generate personalized protein sequence databases by incorporating non-synonymous variants into reference protein sequences. Several informatics pipelines have emerged in recent years for generating these databases (59–66). Fig. 1 illustrates the core processes these pipelines perform, and Table 1 provides a list of typical, currently available software.

Since the likelihood of experimentally observing a peptide decreases as one progresses from a reference database to each variant database (SAAV; novel splice junctions; and novel coding loci in putative intergenic regions), MS/MS datasets are sensibly searched in an iterative fashion through individual databases with separate FDR estimations for PSMs (12, 13, 62, 67). As the total size of the protein database increases, identifying high confidence PSMs requires increased spectral quality with increasingly complete peptide fragmentation. Detailed statistical considerations for iterative search strategy design and FDR estimation in a proteogenomic paradigm have recently been thoroughly described (22).

1.3 Sequencing of antibodies In addition to its role in novel peptide identification and gene annotation, sequence-centric proteogenomics has played a significant role in antibody sequencing (68–71). In vertebrates,

10

antibodies provide the ability for an organism to differentiate between itself and the environment, and a mechanism to fight a diverse range of infections. In order to meet these needs, antibodies possess a tremendous level of sequence diversity, which is achieved through a combination of antibody sequence options and the introduction of mutations. Three types of antibody genes are encoded in the genome: Variable (V), Diversity (D) and Joining (J), and these are combined through a process called V(D)J recombination (72) to produce a diverse library of heavy and light chains that are combined pairwise, and together provide the antigen binding specificity. The affinity maturation process (73) further optimizes antibodies that recognize foreign objects by allowing for a high rate of point mutation introductions, followed by a selection for the strongest binders. At the same time, antibodies recognizing self are eliminated – a process that is defective in autoimmune diseases. Although antibody sequences are encoded in the genome, because of the process by which they are combined and matured, it is not possible to predict the final repertoire of antibody sequences for an individual from the genome alone. However, RNA-Seq can be applied to sequencing the variable region of the light and heavy chains to obtain a sampling of the antibody diversity of an individual.

MS has also been applied to sequencing both recombinant and circulating antibodies. For recombinant antibodies where the sequence is known, MS can be used to confirm the sequence and check for purity. For studying circulating antibodies, the most widely used approach is to use the antigen as bait to enrich for the circulating high affinity antibodies and analyze them with MS/MS. Multiple aliquots are digested with proteases of different specificity to generate comprehensive coverage with overlapping peptides; the spectra can then be interpreted using de novo sequencing approaches (74–77). This requires high quality data, and limits the number of antibodies that can be sequenced in a mixture. An alternative is to use a proteogenomics approach: performing targeted RNA-Seq of the variable region of the light and heavy chains, assembling the reads and translating the assemblies to create a protein sequence database for

11

searching. This approach has been employed to identify high affinity circulating antibodies in infected individuals against HIV (68) and malaria (70) surface proteins that are potentially broadly neutralizing. The approach has also been applied to produce single chain llama antibodies to be used as reagents (71). Llamas and camels produce single chain antibodies in addition to paired heavy-light chain antibodies. The advantage of the single chain antibodies as reagents is that they are small (~15kDa), robust, and once sequenced, they can easily be expressed in E. coli to provide a reproducible resource. These single chain antibodies can be humanized (71) and conjugated with drugs (79); they are, therefore, highly promising candidates for developing therapeutics.

1.4 Viral infections and activation of transposable elements Beyond expression of host genes, viral infections and transposons contribute potential proteinlevel expression in eukaryotic organisms that can be studied using proteogenomic techniques. Viral infections alter gene expression and protein production as the virus highjacks cellular processes that allow for self replication (80–83), and can lead to cell death or cancerous growth of the host cells (84). Eukaryotic genomes also contain mobile elements or transposons that are remnants of ancient viral infections that were incorporated in the host germline genome and then spread throughout the genome through gene copying. It is estimated that about half of the human genome is transposon sequence albeit most are no longer active (85). There is, however, a small subset of the LINE-1 (Long Interspersed Elements) retrotransposons that are capable of autonomous retrotransposition through a copy and paste mechanism using an RNA intermediary, but they are most commonly inactive in somatic cells. However, increased retrotransposition activity has been observed in some disease states including many cancer types (86). The role of retrotransposition in cancer biology is currently unclear and it is not known if it promotes or suppresses tumors, or if it is just an effect of genome instability (87). Also, the mechanism for suppression and activation of retrotransposition is unknown (88) but

12

could provide important insights into cancer biology. Human LINE-1 has two open reading frames: ORF1 and ORF2. ORF1 is an RNA binding protein and ORF2 is an endonuclease and a reverse transcriptase. ORF1 can be quantified using MS-based proteomics, and has been observed by deep MS-based proteomics in many tumors including breast, prostate and ovarian tumors (89). Interesting proteogenomics questions for future research include: Do higher levels of LINE-1 transcripts and ORF1 protein concentrations correlate with tumor progression? Which human transcripts and protein levels correlate with ORF1 protein concentration? Does ORF1 protein concentration correlate with a higher number of somatic LINE-1 insertions?

1.5 Metaproteogenomics Metaproteomics (90, 91), or proteomic investigation of multi-organism communities, represents a unique application for proteomic and genomic data integration. Metagenomic studies typically collect genome data only, representing the functional potential of organisms present in a community, whereas metaproteogenomics adds the additional layer of proteomics data to elucidate what cellular functions are being expressed and utilized by community members. Unfortunately, reference genome sequence databases lack many protein sequences in natural consortia samples, as many organisms within the biological sample may not have a sequenced genome. Despite the meteoric rise in the number of sequenced genomes, there still remain large swaths of bacterial diversity that are unrepresented in databases like GenBank and UniProt. A recent paper by the Banfield group (92) highlights numerous phyla which were previously unknown (not just unsequenced). Indeed, they observe a new phyla radiation with dozens of bacterial phyla entirely separate from known taxonomy. Recent predictions for bacterial diversity on the planet approach 1 trillion distinct species (93). Thus, we expect that reference genome sequence databases will be incomplete at the species level for the foreseeable future.

13

Therefore, a driving need for utilizing genomic data in metaproteomics experiments is the inaccuracy of annotating genomes for novel species. Proteomics has frequently been used not only to improve the set of known proteins in a single organism, but as training and/or testing data for improving gene calling algorithms (94). Moreover, protein annotation is less accurate for genes which have not been previously characterized, a common observation for samples in natural (not laboratory) conditions.

Absent algorithmic advances which allow spectra to be identified without an exact sequence match to a database (95–97), the path forward for metaproteomics involves obtaining sequencing data for the biological sample at hand. This can be either metagenomic or metatranscriptomic data. Each provides crucial information about the potential protein sequences that should be considered when searching tandem mass spectra. Although this introduces additional costs to the project, it has so far been the best way to improve the number of identified peptides in a metaproteomic analysis (98).

2. Analysis of proteogenomic relationships In this section we cover the integration of profiling data from different omics platforms to help elucidate information flow from DNA to RNA to proteins and, most importantly, to phenotype. This includes studies aimed at understanding whether protein abundance can be reliably predicted from mRNA measurements (Section 2.1); assessing the genetic control of mRNA on protein abundance (Section 2.2); and the impact of genetic aberrations on post-translational protein modification (PTM) and signaling (Section 2.3).

2.1 mRNA-protein correlation

14

Correlation between mRNA and protein profiling data has been a topic of considerable research during the past decade, and an excellent review on this has recently been published (2). Early studies focused on the correlation between steady state mRNA and protein abundance for all genes in a single sample, and it was noted in various organisms that relative abundance of proteins in a sample cannot be adequately explained by the corresponding mRNA abundance (3). This can be explained, at least in part, by our understanding that protein abundance is determined by a combination of mRNA abundance, translational regulation, and protein degradation (3). With the availability of paired mRNA and protein data for large sample cohorts, studies on gene-wise correlations between mRNA and protein abundance across many samples also reported modest correlations (5, 8, 99). More recently, the CPTAC consortium has explored mRNA-protein correlations in breast, colorectal, and ovarian cancer samples (10, 12, 13) finding predominantly positive mRNA-protein correlations for all genes with moderate (0.30.45) median correlations (Fig. 2A).

Because both mRNA and protein profiling data are noisy (100, 101), it is unclear how much of the reported low correlation between mRNA and protein expression is due to technological issues versus underlying biology. Statistical methods that attempt to model stochastic and systematic errors in mRNA and protein profiling data have produced higher mRNA-protein correlations (7, 102), and thus it has been suggested that transcription predominantly determines protein abundance (103). Recent studies by the CPTAC consortium have reported non-random associations between the level of mRNA-protein correlation and biological functions of the genes (10, 12, 13). For example, metabolic functions such as amino acid, fatty acid and nucleotide metabolism are enriched for genes with high mRNA-protein correlations, whereas ribosomal and mRNA splicing functions are enriched for genes with low or negative mRNA-protein correlations. A more systematic study using mRNA and protein profiling data from the three CPTAC cancer types showed that proteomic data strengthened the link between

15

gene expression and function for at least 75% of Gene Ontology (GO) biological processes and 90% of KEGG pathways (104). Thus, mRNA-protein discrepancy cannot be simply explained by experimental errors, and biological functions arise from both mRNA- and protein-level regulations.

2.2 Genetic control of mRNA and protein abundance Genetic variation plays an important role in determining mRNA and protein abundance. mRNA and protein expression data from a cohort of samples can be integrated with DNA variation information to study the underlying genetic determinants of gene expression variation. This type of analysis is an extension of the traditional quantitative trait locus (QTL) mapping, in which a section of DNA (the locus) is correlated with variation in a phenotype (i.e., quantitative trait). When expression levels of mRNAs are treated as quantitative traits, the QTL analysis is named eQTL analysis, a method that has become well-established in the field of genetics (105) (Fig. 2B). eQTLs may be cis- or trans- acting, determined by their physical distance from the gene they regulate. Specifically, cis-eQTLs affect gene expression at the same locus of the genotype, whereas trans-eQTLs affect gene expression at a different locus. Although many cis-eQTLs have been reported, mapping trans-eQTLs has been less successful (106). It remains unclear whether the difficulty in mapping trans-eQTLs reflects true biology (i.e., eQTLs primarily act in cis) or computational and statistical challenges. More recently, ribosome occupancy and protein abundance have been used as quantitative traits to identify ribosome occupancy QTLs (rQTLs) and protein abundance QTLs (pQTLs), respectively (8).

An integrative multi-omics study on a set of HapMap Yoruba lymphoblastoid cell lines found that most QTLs were associated with mRNA expression levels, but their impact on protein expression levels were significantly reduced (4). This buffering of protein levels may allow cells to cope with noisy genetic variations and attenuate their impact on downstream phenotypes.

16

Interestingly, a set of cis QTLs that affect protein abundance showed little or no effect on messenger RNA or ribosome levels, suggesting their potential roles in posttranslational regulation. Both the buffering effect and protein abundance specific QTLs have been reported in earlier studies in yeast (5, 6), Arabidopsis (7), mouse (8), and human (9). These studies all suggest that integrating high-throughput proteomic data into QTL analysis could provide new insights into gene expression regulation.

Similarly, analysis of the correlation between copy number alteration (CNA) and mRNA or protein abundance has been used to infer the impact of CNAs on mRNA and protein abundance, including both cis-effects on the abundance of genes in the same loci and transeffects on the abundance of genes at other loci in the genome. Visualization of the resulting correlation matrix in a heatmap can help highlight statistically significant cis- and transcorrelations. Furthermore, visually and statistically comparing the correlation heatmaps for mRNA and protein can reveal relationships between these profiles: cis- and trans-effects in protein (and also phosphoprotein) are generally subsets of mRNA cis- and trans-effects respectively, with more directionally uniform effects at the protein level (10, 12, 13) (Fig. 2B). These correlation matrices can also be used to identify candidate driver genes whose copy number alterations directly drive significant trans-effects by comparing with functional knockdown data in large public databases like LINCS (Library of Integrated Network-based Cellular Signatures) (12).

Efforts have also been made to study the roles of miRNAs in gene expression regulation. miRNAs are small non-coding RNAs that pair to the messenger RNAs (mRNAs) of proteincoding genes to suppress their expression (107). Several studies have shown that in addition to down-regulating mRNA levels, miRNAs also directly repress translation of hundreds of genes (108–110). To investigate all miRNAs simultaneously in their endogenous context, Liu et al.

17

performed an integrative analysis of global miRNA, mRNA, and protein profiles in nine colorectal cancer cell lines using a correlation-based method (11) (Fig. 2B). This study showed that translational repression was involved in more than half, and played a major role in a third of all predicted miRNA-target interactions. These predicted miRNA-target interactions can be further confirmed by more focused miRNA perturbation studies. Interestingly, sequence features known to drive site efficacy in mRNA decay, such as 8mer seed site, site positioning within 3’ UTR, local AU-rich context, and additional 3’ pairing, are generally not applicable to translational repression (11). A key unanswered question is what sequence features determines selectivity for miRNA-mediated translational repression.

2.3 Relating mutations to post-translational modifications and signaling Millions of non-synonymous single nucleotide polymorphisms (nsSNPs) identified by nextgeneration sequencing (NGS) and genome-wide association (GWAS) studies have been correlated with certain phenotypes and diseases (111, 112). However, the functional mechanisms of these associations are often barely understood or completely unknown. One likely explanation is that a subset of these SNPs result in amino acid changes in PTM targets, including targets of phosphorylation (specific to serines, threonines and tyrosines), or acetylation and ubiquitylation of lysines, directly perturbing cell signaling networks (14, 113– 116). Since these four amino acids account for 22.2% of all amino acids in the human proteome (117), they are expected to be disproportionately affected by missense mutations. Substitutions of amino acids that are targets of PTMs can result in destruction, genesis or constitutive activation of PTM sites (114). Moreover, mutations affecting proximal flanking positions of PTM sites might alter the recognition motif for corresponding transferases, e.g., protein kinases recognize, besides other factors, specific motifs on its substrate protein (14, 15).

18

To address this, several studies have assessed the affect of SNP-induced changes to PTM sites, predominantly serine, threonine and tyrosine phosphorylation. In 2008 Ryu et al. (14) used data from Swiss-Prot and Swiss-variant (117) databases to develop software predicting phosphorylation sites accompanied by a database for human phosphovariants, which the authors defined as genetic variations that change phosphorylation sites or their interacting kinases. In this study variants were classified into three groups depending on whether the variant directly affects a phosphorylation site, the flanking region or the kinase itself (Fig. 2C). Two years later, Yang et al. (116) used phosphosites annotated in Phosho.ELM (118), the Human Protein Reference Database (HPRD) (119) and SwissProt (117) together with SNPs annotated in the NCBI dbSNP database (120), to identify 64 phosphorylation sites that potentially result in a disease phenotype, including schizophrenia and hypertension, when substituted by a non-phosphorylatable amino acid. In total 1,451 nsSNPs which were present in dbSNP (downloaded May 2007) occurred in a +/- 7 amino acid flanking region of a phosphosite, thereby potentially influencing the recognition of a kinase towards its preferred substrates. In a related study, Ren et al. (15) carried out a genome-wide analysis of SNPs that potentially influenced protein phosphorylation status. The authors used a combination of dbSNP predicted kinase-specific phosphosites and experimentally detected phosphosites to identify and classify SNPs affecting phosphosignaling. Based on the predicted phosphosites the authors estimated that ~70% of nsSNPs have the potential to affect phosphosignaling, suggesting that a large portion of nsSNPs play an important role in rewiring biological pathways. Creixell et al. (16) described a similar computational approach (ReKINect) to systematically classify and interpret such network-attacking mutations (NAMs) specifically in phosphosignaling. The authors used exome sequencing, bioinformatics and phosphoproteomics to demonstrate as a proof-ofprinciple the existence of six types of NAMs in human cancer cell lines.

19

Additionally, Reimand et al. (121) developed a computational method (ActiveDriver) based on a gene-centric generalized linear regression model to detect aberrant mutation rates proximal to phosphorylation sites. This method was used to analyze 800 genomes spanning eight cancer types to detect mutations specifically targeting the phosphorylation machinery, identifying 44 genes with significantly higher mutation rates in regions with detected phosphosites compared to the gene sequence given its structured and disordered regions. Mutations identified were comprised of both known driver mutations and novel candidates. The authors then extended their approach to the TCGA pan cancer dataset, containing more than 3,000 genomes from 12 cancer types and found mutations affecting phosphosignaling in about 90% of all tumors (17). Two databases that are valuable for PTM proteogenomic analysis that have not yet been described are PhosphoSitePlus (PSP) (122) and g2pDB (115). PSP is a manually curated database of mammalian PTM sites containing over 330,000 non-redundant PTMs (122). In its latest release (2014), PSP introduced the ‘PTMVar’ dataset which intersects missense mutations and PTMs detailing 25,000 PTMs impacted by known variants, about 75% of which relate to phosphorylation. The remaining PTM sites comprise ubiquitylation, acetylation, monomethylation and succinylation sites. These additional modifications, despite their low coverage, enable researchers to interrogate genomic mutations with PTMs beyond phosphorylation. g2pDB (115) is a database mapping protein PTMs to genomic coordinates for all phosphorylations, acetylations and ubiquitinylations that are available in the Global Proteome Machine Database (GPMDB) (123, 124). Overlaying the genome-mapped PTM sites with genome coordinates of known, disease-associated SNPs might reveal a role of these PTM sites in the respective disease. A list of all relevant tools and databases can be found in Table 2.

All of the aforementioned studies focused on the classification of mutations as either directly or indirectly (i.e. in a proximal flanking region) affecting the PTM site. However, there is no consensus on the length of flanking regions, number of classes, and their nomenclature, making

20

it difficult to directly compare the findings of different studies. The integrative analysis of genomic mutations and PTM-mediated signaling shows great promise in providing insights into the mode of action of disease-associated mutations. This type of analysis can aid in discriminating tumor driver mutations from functionally neutral passenger mutations, and ultimately lead to novel personalized treatments. More importantly, the analysis of PTMs is not accessible to genomics sequencing technologies and the ever-increasing collection of published, global PTM-omes at single amino acid resolution demonstrates the indispensable value of state-of-the art MS-based proteomics in the era of precision medicine.

3. Integrative modeling of proteogenomic data Integrative modeling involves the application of statistical, machine learning and networkmodeling tools to data obtained from one or more omics platforms. In this section we focus on the application of integrative modeling to proteogenomic analyses. Models can be developed on combined omics datasets (e.g. genomics and proteomics), or applied to each omics data type separately, and the results comparatively analyzed. We review clustering (Section 3.1) and predictive modeling (Section 3.2)—usually termed class discovery and class prediction, respectively—which are both orthogonal approaches to gaining insight from biological data; and network modeling (Section 3.3), which interprets data in the context of prior biological knowledge and promotes understanding at the level of pathways and cellular mechanisms.

3.1 Unsupervised Clustering Clustering is a method of grouping similar entities—e.g., samples, genes, proteins, etc.— together based on a similarity metric. Since meta-data about the entities—like phenotypes, mutations, disease type, etc.—are not used in the clustering process, the algorithms are termed “unsupervised”, and are primarily used to discover new groups or classes, in addition to

21

computationally validating known biology. Most proteogenomic analysis includes unsupervised clustering of proteome and/or phosphoproteome data, followed by comparison of the resulting clusters to known subgroups, cluster labels derived from genomic data, or other mutation, survival or clinical data.

Clustering of proteome data is performed using a variety of algorithms including hierarchical (10), k-means (12) and model-based clustering (13). While the clustering algorithms can vary, consensus clustering (125) is a common approach used to assess cluster stability and define the natural number of clusters in the data. Visualization of the consensus matrix, along with the delta-area plot and silhouette plots (126) are an effective way to determine the number of clusters in the data.

Once the proteome or phosphoproteome clusters are identified, the samples constituting these clusters can be characterized by enrichment tests for known subgroups (e.g., PAM-50 classification or RPPA groups in breast cancer, methylation subtype in colon cancer, mutation status for relevant genes, or other clinical or survival data). In addition, supervised marker selection methods combined with pathway enrichment analysis (e.g., SAM (127) marker selection followed by Gene Set Enrichment Analysis (GSEA) (128) for pathway enrichment) can also be used to characterize proteome or phosphoproteome clusters by identifying pathways or gene sets that are selectively up or down regulated in each cluster.

An alternative approach to clustering proteome or phosphoproteome data is to project the original data to pathway space and then cluster the projected data. This approach is used by Mertins et. al (12) to cluster phosphoproteome pathways, resulting in a unique cluster not directly observed either in the proteome or phosphoproteome data. The projection to pathway space is performed using single-sample gene set enrichment analysis (ssGSEA) (129), where

22

the

enrichment

of

curated

pathways

(MSigDB

C2

gene

sets,

http://software.broadinstitute.org/gsea/msigdb) in each sample is evaluated. The enrichment scores are then subject to unsupervised clustering, followed by characterization of the derived clusters using the pathways constituting the dataset.

Co-clustering In co-clustering, data from multiple modalities (e.g., mRNA and proteome) are treated as independent “samples”, and clustering is performed over the collection of disparate omics profiles. The key here is to either transform the data so that different modalities are comparable (e.g., z-scores), or to use a similarity metric that is agnostic to the scale of values in the data (e.g., Spearman correlation (130)). Co-clustering mRNA and proteome data (using hierarchical clustering) after filtering to retain genes or proteins with moderate to high correlation is used to show that mRNA profiles of samples are closest to their corresponding proteome profiles, thereby validating sample quality and mitigating concerns regarding tumor heterogeneity in (12) (Fig. 3A).

Multi-omic Clustering Clustering over sample profiles obtained from two or more omic platforms is referred to as multiomic clustering. Unlike co-clustering, where the multi-omic data from each sample provides independent items that are clustered simultaneously, multi-omic clustering attempts to derive an integrative clustering to assign each sample to a single cluster based on combined evidence from multi-omic data. An overall review of multi-omic clustering methods is presented in (131) where algorithms are grouped by strategy:

23



Direct integrative clustering methods use a combined multi-omics dataset as input to the clustering analysis. Examples in this category include iCluster+ (132), LRAcluster (133) and moCluster (134).



Clustering of clusters is an approach where clustering is initially performed on each omics dataset and the results integrated into final cluster assignments. Examples include COCA (135) and SNF (136).



Regulatory integrative clustering harnesses molecular regulatory structures and/or networks to integrate different omics datasets in a robust manner. Examples in this group include PARADIGM (137) and iRafNet (138).

Many of the clustering algorithms use de-novo or regulatory network graphs to model interactions in each omics domain, and to drive integration across different omics datasets. A review of clustering methods from this orthogonal perspective is covered in (139).

3.2 Predictive modeling Predictive modeling is a statistical approach in which models are built to predict a future outcome based on data attributes.

Machine learning, pattern recognition and predictive

analytics all lie within the umbrella of predictive modeling and this method of analysis has been rapidly gaining traction across most scientific disciplines. Predictive modeling and machine learning techniques applied to proteogenomics can greatly improve our ability to accurately diagnose, guide prognosis, and treat disease. For example, global molecular profiling of tissues and tumors enables a shift from non-specific treatment strategies towards a more targeted, personalized approach based on the presence or absence of predictive genetic and/or protein signatures. Typical supervised classification methods used for predictive modeling from omics data include Support Vector Machines (SVMs) (140), Bayesian logistic regression (141) and random forests (142). 24

Machine Learning To date, machine learning and statistical modeling techniques applied to genomics and transcriptomics data have identified genetic profiles predictive of disease diagnoses (143) and drug response (18, 144–146). One would expect the predictive analysis of proteome and phosphoproteome data to be more informative regarding clinical outcomes compared to NGS data, as these data modalities are more proximal to the disease. These techniques have been applied to proteomics data to classify clinically-relevant disease subtypes in cancer (147–149), to define prognosis (150), and to identify biomarkers predicting drug sensitivity (150–152).

Despite the use of predictive modeling in genomics and proteomics independently, studies integrating proteomics and genomics are less common. Several studies using “multi-modal” integration of data types including RNA-Seq, exon expression, and Reverse Phase Protein Array (RPPA) data to predict clinical phenotypes and drug response found no advantage to combining data modalities compared to individual platform analysis and showed gene expression data to be consistently more predictive than RPPA-based proteomics (19, 144). Similarly, Ma et al. (20) found that in machine learning models predicting ten-year survival from 77 breast tumors (12), fusion of four data types (genome, transcriptome, MS/MS-based proteome and phosphoproteome) did not improve the predictive performance of the model. However, they did find proteomics to outperform models based on genomics and transcriptomics data in survival prediction (20). As this is still fairly uncharted territory in proteogenomics, we anticipate to see a wealth of studies focused on assessing the predictive power of proteomics, and phosphoproteomics in disease prognosis, diagnosis and drug response in the future (Fig. 3B).

Supervised Analysis for Marker Selection

25

Aside from machine learning, supervised analysis has been used to derive markers for a variety of distinctions including intrinsic disease subtypes (e.g., PAM-50 subtype in breast cancer or HRD status in ovarian cancer), subtypes identified by clustering, samples with and without mutations in genes of interest (e.g., PIK3CA or TP53 mutations) and survival analysis. For examples, see (12) and (13).

Common marker selection methods used include the t-test, ANOVA (F-test), moderated tests (153) and SAM (127), in addition to non-parametric tests like the Mann-Whitney test and the Kruskal-Wallis test. While these tests are in most cases applied to specific types of omics data, marker ranking from these tests can be combined across multiple omics datasets to derive a global overall rank using rank aggregation algorithms (154, 155).

3.3 Pathway and network modeling Historically, the field of biomedicine has operated under the “molecular biology paradigm”, in which it is assumed that biological function can be explained through the comprehensive knowledge of genes and their associated proteins, and that these proteins operate in linear pathways (156). Despite large-scale efforts to link genotype and phenotype under this paradigm, the relationships between the two are still wholly unresolved and surprisingly complex. Instead, the systems or network biology approach attempts to take into account these complex relationships to better understand this genotype-phenotype connection (157–159). Studies in proteogenomics can build upon current models of network biology, contributing to both network annotation and using established pathway and gene ontology tools in gene-protein enrichment analyses.

Network Annotation

26

In network biology, nodes represent the molecules of interest (gene, protein, metabolite) and edges represent a function, physical or enzymatic relationship. Genetic and physical interaction networks are commonly used models for studying complex systems and disease. These networks can reflect a static system, built from information in a single condition or a differential system, highlighting changes in network connections in two distinct states, revealing statespecific and disease-specific interactions (see reviews (160, 161)).

Biological networks are typically built in three ways: (1) curation of available physical or biochemical interaction data, (2) computational predictions based on sequence similarity, gene co-occurrence, or gene co-expression; and (3) comprehensive assessment of whole genomes or proteomes (158). MS-based proteomics and PTM-omics can be layered atop these scaffold networks to both fine tune the biological network representation and identify network rewiring in a disease state (162) (Fig. 3C). For example, Zhang et al. identified protein-protein interaction network modules that were enriched in down-regulated proteins in a poor-prognosis colorectal cancer subtype (10). Similarly, analysis of gene-protein co-expression found differential interaction patterns in a subset of network modules in basal-enriched and luminal-enriched breast cancer subgroups (12).

Pathway and Gene Ontology (GO) Enrichment Several approaches for pathway and GO enrichment analysis have been developed, including over-representation and Gene Set Enrichment Analysis (GSEA). Over-representation analysis uses the Fisher’s exact test to identify pathways and GO terms with significant overrepresentation in a gene or protein list of interest, which should be predefined based on differential expression, clustering, or other upstream analyses. Representative tools in this category include DAVID (163) and WebGestalt (164) (Table 3). GSEA (128) ranks genes or proteins in the entire dataset based on differential expression or association to a continuous

27

phenotype, and then uses a modified version of the Kolmogorov-Smirnov test (165) to identify pathways, signatures and GO terms in which the gene members are enriched at the top or bottom of the ranked list (Fig. 3C). As both the over-representation method and the GSEA approach ignore pathway topology when performing enrichment analysis, an additional tool, SPIA (166), was established to address this limitation. Further, because these methods all perform enrichment analysis at the gene level, they do not allow for phosphosite-level enrichment analysis, which is critical for understanding kinase-phosphate signal transduction in phosphoproteome profiling studies. PHOXTRACK (167) was developed for this purpose, and modifies the GSEA approach to search for an enrichment of known kinase targets in an uploaded phosphoproteomics profile dataset (Table 3).

4. Data sharing and visualization In this section we describe useful frameworks for organizing, sharing, and visualizing multiomics data in the context of the genome (Section 4.1) or protein interactome (Section 4.2). Efficient data sharing and visualization plays a central role in making complicated proteogenomic data directly available and useful to the broad biological community. One successful example is the cBioPortal (168), which allows easy retrieval and visualization of multi-omics data from many studies for user-selected genes or pathways. The simple user query interface of cBioPortal hides data complexity from the users, and is self-explanatory and well suited for answering focused questions. Herein we will describe additional tools relating to data sharing and visualization allowing for both exploratory analyses and focused queries of proteogenomics data.

4.1 Genome-based data sharing and visualization

28

The genomic sequence provides a natural platform for the dissemination and visualization of genome-anchored information and as such, genome browsers, such as the UCSC genome browser and the Ensembl Genome Browser (169), have long been used to present sequence alignment and gene annotation information. Building upon these, tools such as the Integrative Genomics Viewer (IGV) (170) have been developed to extend the capacity of genome browsers to allow integrative visualization of genomic and transcriptomic sequencing data in the context of the genomic sequence. The addition of proteomic data into genome browsers enables covisualization with gene annotation information as well as genomic and transcriptomic abundance and coverage information, and thus facilitates proteogenomic data integration. A major obstacle lies in mapping peptides to their genomic locations, and many tools have been developed for both genome-centric mapping and visualization. The mapping process is either based on genome sequence alone or the composite of the genome and existing annotation. An early study by Kalume et. al. used the TBLASTN algorithm to determine the corresponding regions in the genome for each peptide (171) which could then be visualized in the Ensembl Genome Browser. Annotation arising from this analysis was shared with the community using The Distributed Annotation System provided by Ensembl (171). Similar tools including Pepline (172) and The Proteogenomic Mapping Pipeline (173) use string searching algorithms to map peptides against a given genome translated in all six reading frames. However, these tools were not designed for visualization purposes, creating output files incompatible with genome browser visualization tools.

Other tools take advantage of extra gene annotation information to speed up this mapping process and facilitate visualization of peptides in genome browsers (UCSC, Ensembl, Integrative Genomics Viewer (IGV)), including PeptideAtlas (174), iPiG (175), PG Nexus (176), proBAMsuite (177) and PGx (178), among others (179–181). PG-Nexus (176) adopts a simplified SAM file so that proteomic data can be visualized in alignment with genome and

29

transcriptome. The recently introduced proBAM format stores PSMs within the context of the genome. The proteomic identifications from three published proteomics datasets, including colorectal cancer samples from CPTAC (13) and cancer cell line samples from two different sources (182, 183), have been converted to proBAM format and made available in a JBrowsebased genome browser (http://proteogenomics.zhang-lab.org/), allowing for dissemination of proteomics data to a general audience beyond the proteomics community (177). Fig. 4 shows an example of proBAM file visualization using data from colorectal cancer cell lines (182). PGx (178) maps identified peptides onto their putative genomic coordinates based on a gene model and outputs a Browser Extensible Data (BED) and BEDgraph file for both peptide location and relative quantitation visualization, and has been applied to mapping and visualization of CPTAC data in the UCSC Genome Browser (http://fenyolab.org/ucsc_cptac_v1). Beyond the tools mentioned above, additional tools like VESPA (184) provide a stand-alone visualization solution. Tools connecting the peptide identifications to their underlying evidence, and providing easily accessible browser-based interactive visualization of large sets of MS data, are also being developed (185). With the increasing availability of genomic and transcriptomic data from highthroughput sequencing technologies, putative gene structure could be inferred for un-annotated genomes. These approaches are expected to promote the development of additional mapping applications.

Despite the potential of genome browser-based visualization, these tools typically require specific input formats such as Sequence Alignment/Map (SAM) and BED, most of which were designed for genomic but not proteomic data. A standard file format including both proteomicsspecific

information

and

genome

mapping

coordinates

would

be

useful

for

future

proteogenomics studies. To facilitate the visualization and sharing of the rapidly growing body of proteogenomics data, the Human Proteome Organization (HUPO) Proteomics Standards Initiative

(PSI)

is

working

on

two

new

proteogenomics

data

format,

proBAM

30

(http://www.psidev.info/proBAM)

and

proBED

(http://www.psidev.info/probed).

The

basic

concepts of the proBAM format are: 1) to hold more proteomics information; 2) provide functionality highly analogous to BAM, which connects proteomics data to a large number of bioinformatics tools that have been developed for BAM files; 3) to serve as a well-defined interface between PSM identification and downstream analyses. proBAM takes advantage of existing data formats and imports additional information from the peptide identification process to the genome browser visualization. As a highly compact file format, proBAM also facilitates public sharing and re-use of MS-based proteomics data. Similarly, proBED adopts the BED format and can be loaded directly to the UCSC genome browser as an annotation track. The proBAM format contains more detail and structure than proBED and thus it is expected that a proBAM to proBED conversion should be possible under most circumstances.

4.2 Network-based data sharing and visualization Network-based data visualization plays a central role in biological knowledge discovery and a wide range of tools have been developed to visualize and analyze omics data in the context of biological networks (186). Most tools visualize a network as a node-link diagram in which nodes represent molecular components and edges represent relationships between the components. Typically, four types of data can be visualized in a node-link diagram based on various approaches that are available in different tools.

(I) Single binary data, such as functional annotation data for genes or significant calls from statistical analysis. A single set of binary data can be visualized in a network by changing one of the node attributes, including node color, node border color, node label color, node size, node label size and node shape. Tools such as Cytoscape (187) and VisANT (188) support this function.

31

(II) Single continuous data, such as fold changes or p-values for all genes in an omics study. For this type of data, a color gradient or size gradient can be used to color or size the nodes or node labels in the network according to the scores. This function is also available in Cytoscape and VisANT.

(III) Composite binary data, such as mutation status for genes in multiple samples. The approach described in type I can map the measurements from each sample to the network but only one sample at a time (185, 188). To visualize the measurements from more than one sample, Cerebral (190) developed the ‘small multiples’ method that can view the data in parallel by creating multiple versions of the same network in a grid, and each version represents a different sample. Some other tools, e.g., VisANT (188) and VistaClara (191), use animation-like visualization to update node attributes sequentially in the network.

(IV) Composite continuous data, such as gene expression data from multiple conditions. Beside animation and small multiples, some other tools have been developed to visualize the complete dataset simultaneously. For example, GENeVis (192), VisANT and VANTED (193) can replace each node in the network by a bar plot in which each bar represents a measurement from one condition, while VistaClara adds the bar plot below the node to visualize the data.

The above tools are very efficient for small networks. However, when the number of nodes goes beyond a few hundred, it is difficult to show the details of the embedded visualizations. Furthermore, simultaneous visualization of multi-omics data from multiple experimental conditions is almost impossible. The challenge cannot be addressed by developing graph layout algorithms (194) or extending the two-dimensional representation of the network into threedimension (195).

32

To address this challenge, Shi et al. developed NetGestalt (196), a web-based tool that exploits the inherent hierarchical modular architecture of a biological network to create a one dimensional layout of the network nodes (e.g., genes or proteins), with network-module information displayed below the linearly ordered nodes. Notably, this transformation makes it possible to visualize the above four types of data below the linearly ordered nodes by barcode plots, bar plots and heat maps. Thus, diverse experimental data such as DNA methylation, gene mutation, copy number variation, gene expression, and protein expression and modification, can be simultaneously visualized in the context of the network. In addition, annotations from Gene Ontology, pathway databases, and other resources can also be visualized concurrently to facilitate the interpretation of experimental data. Fig. 5 shows a case study using NetGestalt to identify subtype specific modules from the human protein-protein interaction network (197) based on the CPTAC colorectal proteomics data (10).

Although network-based data visualization has advanced rapidly in the last decade, communicating network-based findings between scientists remains a big challenge. Web-based visualization tools such as CellMaps (198) provide a possible platform for sharing visualization results online for remote access and editing. However, sharing is not currently available in web applications allowing network-based data visualization. NDEx (the network data exchange) (199) allows scientists to create communities and to share networks but does not support data visualization. A network-based data visualization tool with the capability to share results would be particularly useful in helping scientists identify, record and communicate their findings.

5. Conclusion Rapid technological developments occurring over the last decade have made it possible to generate large proteogenomic data sets, thereby driving the development of new methods for

33

proteogenomic data analysis. It is now quite common to see genomics data informing proteomics data analysis or vice versa. Further, significant progress has been made in our understanding of the complex relationships between DNA, RNA and protein, and these developments are being used to both understand basic biological process, and improve molecular markers and drug target discovery. There are, however, still many challenges in proteogenomics including the limited coverage and dynamic range of MS-based proteomics (200), and the difficulty of generating proteogenomic data sets with large numbers of samples due to cost and sample accessibility. This mismatched combination of limited sample numbers and measurements with large variability remains challenging for most analysis methods and can lead to ungeneralizable findings. The large number of quantitative measurements across multiple data modalities also poses a serious challenge for visualizing the data in a meaningful way. For example, visualizing large-scale gene expression analysis alongside proteomics and phosphoproteomics data requires that we present gene-level quantitation while also maintaining information on protein isoforms and expression differences across multiple phosphosites. Moreover, our knowledge of basic biological processes and pathways is still limited, clouding our ability to explain many of the proteogenomic relationships we identify. Despite these remaining challenges, we foresee continued and rapid development in this field both from a data generation perspective—with deeper, faster and cheaper genomic and proteomic profiling technologies—and a data analysis perspective, with the continued development of more sophisticated computational tools for data integration, modeling and visualization.

Acknowledgements KVR, SHP and DF were funded by National Cancer Institute (NCI) CPTAC award U24 CA210972. KVR and DF were funded by contract 13XS068 from Leidos Biomedical Research, Inc. SHP was funded by an Early Career Award from the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research. XW, JW, and BZ were funded by

34

National Cancer Institute (NCI) CPTAC awards U24CA159988 and U24CA210954 and by contract 13XS029 from Leidos Biomedical Research, Inc. BZ is Cancer Prevention & Research Institutes of Texas (CPRIT RR160027) Scholar and McNair Medical Institute Scholar. KK, KRC and DRM were funded by National Cancer Institute (NCI) CPTAC awards U24CA160034, U24CA210986 and U24CA210979.

35

References 1. Jaffe, J. D., Berg, H. C., and Church, G. M. (2004) Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics. 4, 59–77 2. Liu, Y., Beyer, A., and Aebersold, R. (2016) On the Dependency of Cellular Protein Levels on mRNA Abundance. Cell. 165, 535–550 3. Vogel, C., and Marcotte, E. M. (2012) Insights into the regulation of protein abundance from proteomic and transcriptomic analyses. Nat. Rev. Genet. 10.1038/nrg3185 4. Battle, A., Khan, Z., Wang, S. H., Mitrano, A., Ford, M. J., Pritchard, J. K., and Gilad, Y. (2015) Genomic variation. Impact of regulatory variation from RNA to protein. Science. 347, 664–667 5. Foss, E. J., Radulovic, D., Shaffer, S. A., Goodlett, D. R., Kruglyak, L., and Bedalov, A. (2011) Genetic variation shapes protein networks mainly through non-transcriptional mechanisms. PLoS Biol. 9, e1001144 6. Foss, E. J., Radulovic, D., Shaffer, S. A., Ruderfer, D. M., Bedalov, A., Goodlett, D. R., and Kruglyak, L. (2007) Genetic basis of proteome variation in yeast. Nat. Genet. 39, 1369– 1375 7. Fu, J., Keurentjes, J. J. B., Bouwmeester, H., America, T., Verstappen, F. W. A., Ward, J. L., Beale, M. H., de Vos, R. C. H., Dijkstra, M., Scheltema, R. A., Johannes, F., Koornneef, M., Vreugdenhil, D., Breitling, R., and Jansen, R. C. (2009) System-wide molecular evidence for phenotypic buffering in Arabidopsis. Nat. Genet. 41, 166–167 8. Ghazalpour, A., Bennett, B., Petyuk, V. A., Orozco, L., Hagopian, R., Mungrue, I. N., Farber, C. R., Sinsheimer, J., Kang, H. M., Furlotte, N., Park, C. C., Wen, P.-Z., Brewer, H., Weitz, K., Camp, D. G., Pan, C., Yordanova, R., Neuhaus, I., Tilford, C., Siemers, N., Gargalovic, P., Eskin, E., Kirchgessner, T., Smith, D. J., Smith, R. D., and Lusis, A. J. (2011) Comparative analysis of proteome and transcriptome variation in mouse. PLoS Genet. 7, e1001393 9. Lappalainen, T., Sammeth, M., Friedländer, M. R., ’t Hoen, P. A. C., Monlong, J., Rivas, M. A., Gonzàlez-Porta, M., Kurbatova, N., Griebel, T., Ferreira, P. G., Barann, M., Wieland, T., Greger, L., van Iterson, M., Almlöf, J., Ribeca, P., Pulyakhina, I., Esser, D., Giger, T., Tikhonov, A., Sultan, M., Bertier, G., MacArthur, D. G., Lek, M., Lizano, E., Buermans, H. P. J., Padioleau, I., Schwarzmayr, T., Karlberg, O., Ongen, H., Kilpinen, H., Beltran, S., Gut, M., Kahlem, K., Amstislavskiy, V., Stegle, O., Pirinen, M., Montgomery, S. B., Donnelly, P., McCarthy, M. I., Flicek, P., Strom, T. M., Geuvadis Consortium, Lehrach, H., Schreiber, S., Sudbrak, R., Carracedo, A., Antonarakis, S. E., Häsler, R., Syvänen, A.-C., van Ommen, G.-J., Brazma, A., Meitinger, T., Rosenstiel, P., Guigó, R., Gut, I. G., Estivill, X., and Dermitzakis, E. T. (2013) Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 501, 506–511 10. Zhang, B., Wang, J., Wang, X., Zhu, J., Liu, Q., Shi, Z., Chambers, M. C., Zimmerman, L. J., Shaddox, K. F., Kim, S., Davies, S. R., Wang, S., Wang, P., Kinsinger, C. R., Rivers, R. C., Rodriguez, H., Townsend, R. R., Ellis, M. J. C., Carr, S. A., Tabb, D. L., Coffey, R. J., Slebos, R. J. C., Liebler, D. C., and NCI CPTAC (2014) Proteogenomic characterization of human colon and rectal cancer. Nature. 513, 382–387 11. Liu, Q., Halvey, P. J., Shyr, Y., Slebos, R. J. C., Liebler, D. C., and Zhang, B. (2013) Integrative omics analysis reveals the importance and scope of translational repression in microRNA-mediated regulation. Mol. Cell. Proteomics MCP. 12, 1900–1911 12. Mertins, P., Mani, D. R., Ruggles, K. V., Gillette, M. A., Clauser, K. R., Wang, P., Wang, X., Qiao, J. W., Cao, S., Petralia, F., Kawaler, E., Mundt, F., Krug, K., Tu, Z., Lei, J. T., Gatza, M. L., Wilkerson, M., Perou, C. M., Yellapantula, V., Huang, K., Lin, C., McLellan, M. D., Yan, P., Davies, S. R., Townsend, R. R., Skates, S. J., Wang, J., Zhang, B., Kinsinger, C. R., Mesri, M., Rodriguez, H., Ding, L., Paulovich, A. G., Fenyö, D., Ellis, M.

36

13.

14. 15. 16.

17. 18.

19. 20. 21. 22. 23. 24. 25.

J., Carr, S. A., and NCI CPTAC (2016) Proteogenomics connects somatic mutations to signalling in breast cancer. Nature. 534, 55–62 Zhang, H., Liu, T., Zhang, Z., Payne, S. H., Zhang, B., McDermott, J. E., Zhou, J.-Y., Petyuk, V. A., Chen, L., Ray, D., Sun, S., Yang, F., Chen, L., Wang, J., Shah, P., Cha, S. W., Aiyetan, P., Woo, S., Tian, Y., Gritsenko, M. A., Clauss, T. R., Choi, C., Monroe, M. E., Thomas, S., Nie, S., Wu, C., Moore, R. J., Yu, K.-H., Tabb, D. L., Fenyö, D., Bafna, V., Wang, Y., Rodriguez, H., Boja, E. S., Hiltke, T., Rivers, R. C., Sokoll, L., Zhu, H., Shih, I.-M., Cope, L., Pandey, A., Zhang, B., Snyder, M. P., Levine, D. A., Smith, R. D., Chan, D. W., Rodland, K. D., and CPTAC Investigators (2016) Integrated Proteogenomic Characterization of Human High-Grade Serous Ovarian Cancer. Cell. 166, 755–765 Ryu, G.-M., Song, P., Kim, K.-W., Oh, K.-S., Park, K.-J., and Kim, J. H. (2009) Genomewide analysis to predict protein sequence variations that change phosphorylation sites or their corresponding kinases. Nucleic Acids Res. 37, 1297–1307 Ren, J., Jiang, C., Gao, X., Liu, Z., Yuan, Z., Jin, C., Wen, L., Zhang, Z., Xue, Y., and Yao, X. (2010) PhosSNP for systematic analysis of genetic polymorphisms that influence protein phosphorylation. Mol. Cell. Proteomics MCP. 9, 623–634 Creixell, P., Schoof, E. M., Simpson, C. D., Longden, J., Miller, C. J., Lou, H. J., Perryman, L., Cox, T. R., Zivanovic, N., Palmeri, A., Wesolowska-Andersen, A., Helmer-Citterich, M., Ferkinghoff-Borg, J., Itamochi, H., Bodenmiller, B., Erler, J. T., Turk, B. E., and Linding, R. (2015) Kinome-wide decoding of network-attacking mutations rewiring cancer signaling. Cell. 163, 202–217 Reimand, J., Wagih, O., and Bader, G. D. (2013) The mutational landscape of phosphorylation signaling in cancer. Sci. Rep. 3, 2651 Barretina, J., Caponigro, G., Stransky, N., Venkatesan, K., Margolin, A. A., Kim, S., Wilson, C. J., Lehár, J., Kryukov, G. V., Sonkin, D., Reddy, A., Liu, M., Murray, L., Berger, M. F., Monahan, J. E., Morais, P., Meltzer, J., Korejwa, A., Jané-Valbuena, J., Mapa, F. A., Thibault, J., Bric-Furlong, E., Raman, P., Shipway, A., Engels, I. H., Cheng, J., Yu, G. K., Yu, J., Aspesi, P., de Silva, M., Jagtap, K., Jones, M. D., Wang, L., Hatton, C., Palescandolo, E., Gupta, S., Mahan, S., Sougnez, C., Onofrio, R. C., Liefeld, T., MacConaill, L., Winckler, W., Reich, M., Li, N., Mesirov, J. P., Gabriel, S. B., Getz, G., Ardlie, K., Chan, V., Myer, V. E., Weber, B. L., Porter, J., Warmuth, M., Finan, P., Harris, J. L., Meyerson, M., Golub, T. R., Morrissey, M. P., Sellers, W. R., Schlegel, R., and Garraway, L. A. (2012) The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature. 483, 603–607 Ray, B., Henaff, M., Ma, S., Efstathiadis, E., Peskin, E. R., Picone, M., Poli, T., Aliferis, C. F., and Statnikov, A. (2014) Information content and analysis methods for multi-modal high-throughput biomedical data. Sci. Rep. 4, 4411 Ma, S., Ren, J., and Fenyö, D. (2016) Breast Cancer Prognostics Using Multi-Omics Data. AMIA Summits Transl. Sci. Proc. 2016, 52–59 Menschaert, G., and Fenyö, D. (2015) Proteogenomics from a bioinformatics angle: A growing field. Mass Spectrom. Rev. 10.1002/mas.21483 Nesvizhskii, A. I. (2014) Proteogenomics: concepts, applications and computational strategies. Nat. Methods. 11, 1114–1125 Wang, X., Liu, Q., and Zhang, B. (2014) Leveraging the complementary nature of RNA-Seq and shotgun proteomics data. Proteomics. 14, 2676–2687 Wang, X., and Zhang, B. (2014) Integrating genomic, transcriptomic, and interactome data to improve Peptide and protein identification in shotgun proteomics. J. Proteome Res. 13, 2715–2723 Aebersold, R., and Mann, M. (2016) Mass-spectrometric exploration of proteome structure and function. Nature. 537, 347–355

37

26. Medzihradszky, K. F., and Chalkley, R. J. (2015) Lessons in de novo peptide sequencing by tandem mass spectrometry. Mass Spectrom. Rev. 34, 43–63 27. Yan, Y., Kusalik, A. J., and Wu, F.-X. (2015) Recent Developments in Computational Methods for De Novo Peptide Sequencing from Tandem Mass Spectrometry (MS/MS). Protein Pept. Lett. 22, 983–991 28. Clauser, K. R., Baker, P., and Burlingame, A. L. (1999) Role of accurate mass measurement (+/- 10 ppm) in protein identification strategies employing MS or MS/MS and database searching. Anal. Chem. 71, 2871–2882 29. Castellana, N., and Bafna, V. (2010) Proteogenomics to discover the full coding content of genomes: a computational perspective. J. Proteomics. 73, 2124–2135 30. Sheynkman, G. M., Shortreed, M. R., Cesnik, A. J., and Smith, L. M. (2016) Proteogenomics: Integrating Next-Generation Sequencing and Mass Spectrometry to Characterize Human Proteomic Variation. Annu. Rev. Anal. Chem. Palo Alto Calif. 9, 521–545 31. Krug, K., Nahnsen, S., and Macek, B. (2011) Mass spectrometry at the interface of proteomics and genomics. Mol. Biosyst. 7, 284–291 32. Yates, J. R., Eng, J. K., and McCormack, A. L. (1995) Mining genomes: correlating tandem mass spectra of modified and unmodified peptides to sequences in nucleotide databases. Anal. Chem. 67, 3202–3210 33. Link, A. J., Hays, L. G., Carmack, E. B., and Yates, J. R. (1997) Identifying the major proteome components of Haemophilus influenzae type-strain NCTC 8143. Electrophoresis. 18, 1314–1334 34. Neubauer, G., King, A., Rappsilber, J., Calvio, C., Watson, M., Ajuh, P., Sleeman, J., Lamond, A., and Mann, M. (1998) Mass spectrometry and EST-database searching allows characterization of the multi-protein spliceosome complex. Nat. Genet. 20, 46–50 35. Jungblut, P. R., Müller, E. C., Mattow, J., and Kaufmann, S. H. (2001) Proteomics reveals open reading frames in Mycobacterium tuberculosis H37Rv not predicted by genomics. Infect. Immun. 69, 5905–5907 36. Choudhary, J. S., Blackstock, W. P., Creasy, D. M., and Cottrell, J. S. (2001) Interrogating the human genome using uninterpreted mass spectrometry data. Proteomics. 1, 651–667 37. Merrihew, G. E., Davis, C., Ewing, B., Williams, G., Käll, L., Frewen, B. E., Noble, W. S., Green, P., Thomas, J. H., and MacCoss, M. J. (2008) Use of shotgun proteomics for the identification, confirmation, and correction of C. elegans gene annotations. Genome Res. 18, 1660–1669 38. Castellana, N. E., Payne, S. H., Shen, Z., Stanke, M., Bafna, V., and Briggs, S. P. (2008) Discovery and revision of Arabidopsis genes by proteogenomics. Proc. Natl. Acad. Sci. U. S. A. 105, 21034–21038 39. Fermin, D., Allen, B. B., Blackwell, T. W., Menon, R., Adamski, M., Xu, Y., Ulintz, P., Omenn, G. S., and States, D. J. (2006) Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics. Genome Biol. 7, R35 40. Gupta, N., Tanner, S., Jaitly, N., Adkins, J. N., Lipton, M., Edwards, R., Romine, M., Osterman, A., Bafna, V., Smith, R. D., and Pevzner, P. A. (2007) Whole proteome analysis of post-translational modifications: applications of mass-spectrometry for proteogenomic annotation. Genome Res. 17, 1362–1377 41. Potgieter, M. G., Nakedi, K. C., Ambler, J. M., Nel, A. J. M., Garnett, S., Soares, N. C., Mulder, N., and Blackburn, J. M. (2016) Proteogenomic Analysis of Mycobacterium smegmatis Using High Resolution Mass Spectrometry. Front. Microbiol. 7, 427 42. Krug, K., Carpy, A., Behrends, G., Matic, K., Soares, N. C., and Macek, B. (2013) Deep coverage of the Escherichia coli proteome enables the assessment of false discovery rates in simple proteogenomic experiments. Mol. Cell. Proteomics MCP. 12, 3420–3430

38

43. Borchert, N., Dieterich, C., Krug, K., Schütz, W., Jung, S., Nordheim, A., Sommer, R. J., and Macek, B. (2010) Proteogenomics of Pristionchus pacificus reveals distinct proteome structure of nematode models. Genome Res. 20, 837–846 44. Baerenfaller, K., Grossmann, J., Grobei, M. A., Hull, R., Hirsch-Hoffmann, M., Yalovsky, S., Zimmermann, P., Grossniklaus, U., Gruissem, W., and Baginsky, S. (2008) Genomescale proteomics reveals Arabidopsis thaliana gene models and proteome dynamics. Science. 320, 938–941 45. Gallien, S., Perrodou, E., Carapito, C., Deshayes, C., Reyrat, J.-M., Van Dorsselaer, A., Poch, O., Schaeffer, C., and Lecompte, O. (2009) Ortho-proteogenomics: multiple proteomes investigation through orthology and a new MS-based protocol. Genome Res. 19, 128–135 46. Tanner, S., Shen, Z., Ng, J., Florea, L., Guigó, R., Briggs, S. P., and Bafna, V. (2007) Improving gene annotation using peptide mass spectrometry. Genome Res. 17, 231–239 47. Xia, D., Sanderson, S. J., Jones, A. R., Prieto, J. H., Yates, J. R., Bromley, E., Tomley, F. M., Lal, K., Sinden, R. E., Brunk, B. P., Roos, D. S., and Wastling, J. M. (2008) The proteome of Toxoplasma gondii: integration with the genome provides novel insights into gene expression and annotation. Genome Biol. 9, R116 48. Elias, J. E., and Gygi, S. P. (2010) Target-decoy search strategy for mass spectrometrybased proteomics. Methods Mol. Biol. Clifton NJ. 604, 55–71 49. Hu, Q., Noll, R. J., Li, H., Makarov, A., Hardman, M., and Graham Cooks, R. (2005) The Orbitrap: a new mass spectrometer. J. Mass Spectrom. JMS. 40, 430–443 50. Cox, J., and Mann, M. (2007) Is proteomics the new genomics? Cell. 130, 395–398 51. Michalski, A., Damoc, E., Lange, O., Denisov, E., Nolting, D., Müller, M., Viner, R., Schwartz, J., Remes, P., Belford, M., Dunyach, J.-J., Cox, J., Horning, S., Mann, M., and Makarov, A. (2012) Ultra high resolution linear ion trap Orbitrap mass spectrometer (Orbitrap Elite) facilitates top down LC MS/MS and versatile peptide fragmentation modes. Mol. Cell. Proteomics MCP. 11, O111.013698 52. Scheltema, R. A., Hauschild, J.-P., Lange, O., Hornburg, D., Denisov, E., Damoc, E., Kuehn, A., Makarov, A., and Mann, M. (2014) The Q Exactive HF, a Benchtop mass spectrometer with a pre-filter, high-performance quadrupole and an ultra-high-field Orbitrap analyzer. Mol. Cell. Proteomics MCP. 13, 3698–3708 53. Eliuk, S., and Makarov, A. (2015) Evolution of Orbitrap Mass Spectrometry Instrumentation. Annu. Rev. Anal. Chem. Palo Alto Calif. 8, 61–80 54. Tisserant, E., Da Silva, C., Kohler, A., Morin, E., Wincker, P., and Martin, F. (2011) Deep RNA sequencing improved the structural annotation of the Tuber melanosporum transcriptome. New Phytol. 189, 883–891 55. Martin, J., Zhu, W., Passalacqua, K. D., Bergman, N., and Borodovsky, M. (2010) Bacillus anthracis genome organization in light of whole transcriptome sequencing. BMC Bioinformatics. 11 Suppl 3, S10 56. Hoff, K. J., Lange, S., Lomsadze, A., Borodovsky, M., and Stanke, M. (2016) BRAKER1: Unsupervised RNA-Seq-Based Genome Annotation with GeneMark-ET and AUGUSTUS. Bioinforma. Oxf. Engl. 32, 767–769 57. Coleman, S. J., Zeng, Z., Wang, K., Luo, S., Khrebtukova, I., Mienaltowski, M. J., Schroth, G. P., Liu, J., and MacLeod, J. N. (2010) Structural annotation of equine protein-coding genes determined by mRNA sequencing. Anim. Genet. 41 Suppl 2, 121–130 58. Alfaro, J. A., Sinha, A., Kislinger, T., and Boutros, P. C. (2014) Onco-proteogenomics: cancer proteomics joins forces with genomics. Nat. Methods. 11, 1107–1113 59. Fan, J., Saha, S., Barker, G., Heesom, K. J., Ghali, F., Jones, A. R., Matthews, D. A., and Bessant, C. (2015) Galaxy Integrated Omics: Web-based Standards-Compliant Workflows for Proteomics Informed by Transcriptomics. Mol. Cell. Proteomics MCP. 14, 3087–3093

39

60. Krug, K., Popic, S., Carpy, A., Taumer, C., and Macek, B. (2014) Construction and assessment of individualized proteogenomic databases for large-scale analysis of nonsynonymous single nucleotide variants. Proteomics. 14, 2699–2708 61. Li, Y., Wang, X., Cho, J.-H., Shaw, T. I., Wu, Z., Bai, B., Wang, H., Zhou, S., Beach, T. G., Wu, G., Zhang, J., and Peng, J. (2016) JUMPg: An Integrative Proteogenomics Pipeline Identifying Unannotated Proteins in Human Brain and Cancer Cells. J. Proteome Res. 15, 2309–2320 62. Ruggles, K. V., Tang, Z., Wang, X., Grover, H., Askenazi, M., Teubl, J., Cao, S., McLellan, M. D., Clauser, K. R., Tabb, D. L., Mertins, P., Slebos, R., Erdmann-Gilmore, P., Li, S., Gunawardena, H. P., Xie, L., Liu, T., Zhou, J.-Y., Sun, S., Hoadley, K. A., Perou, C. M., Chen, X., Davies, S. R., Maher, C. A., Kinsinger, C. R., Rodland, K. D., Zhang, H., Zhang, Z., Ding, L., Townsend, R. R., Rodriguez, H., Chan, D., Smith, R. D., Liebler, D. C., Carr, S. A., Payne, S., Ellis, M. J., and Fenyő, D. (2016) An Analysis of the Sensitivity of Proteogenomic Mapping of Somatic Mutations and Novel Splicing Events in Cancer. Mol. Cell. Proteomics MCP. 15, 1060–1071 63. Wang, X., and Zhang, B. (2013) customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search. Bioinforma. Oxf. Engl. 29, 3235–3237 64. Wen, B., Xu, S., Zhou, R., Zhang, B., Wang, X., Liu, X., Xu, X., and Liu, S. (2016) PGA: an R/Bioconductor package for identification of novel peptides using a customized database derived from RNA-Seq. BMC Bioinformatics. 17, 244 65. Woo, S., Cha, S. W., Merrihew, G., He, Y., Castellana, N., Guest, C., MacCoss, M., and Bafna, V. (2014) Proteogenomic database construction driven from large scale RNA-seq data. J. Proteome Res. 13, 21–28 66. Zickmann, F., and Renard, B. Y. (2015) MSProGene: integrative proteogenomics beyond six-frames and single nucleotide polymorphisms. Bioinforma. Oxf. Engl. 31, i106-115 67. Woo, S., Cha, S. W., Bonissone, S., Na, S., Tabb, D. L., Pevzner, P. A., and Bafna, V. (2015) Advanced Proteogenomic Analysis Reveals Multiple Peptide Mutations and Complex Immunoglobulin Peptides in Colon Cancer. J. Proteome Res. 14, 3555–3567 68. Scheid, J. F., Mouquet, H., Ueberheide, B., Diskin, R., Klein, F., Oliveira, T. Y. K., Pietzsch, J., Fenyo, D., Abadir, A., Velinzon, K., Hurley, A., Myung, S., Boulad, F., Poignard, P., Burton, D. R., Pereyra, F., Ho, D. D., Walker, B. D., Seaman, M. S., Bjorkman, P. J., Chait, B. T., and Nussenzweig, M. C. (2011) Sequence and structural convergence of broad and potent HIV antibodies that mimic CD4 binding. Science. 333, 1633–1637 69. Cheung, W. C., Beausoleil, S. A., Zhang, X., Sato, S., Schieferl, S. M., Wieler, J. S., Beaudet, J. G., Ramenani, R. K., Popova, L., Comb, M. J., Rush, J., and Polakiewicz, R. D. (2012) A proteomics approach for the identification and cloning of monoclonal antibodies from serum. Nat. Biotechnol. 30, 447–452 70. Muellenbeck, M. F., Ueberheide, B., Amulic, B., Epp, A., Fenyo, D., Busse, C. E., Esen, M., Theisen, M., Mordmüller, B., and Wardemann, H. (2013) Atypical and classical memory B cells produce Plasmodium falciparum neutralizing antibodies. J. Exp. Med. 210, 389–399 71. Fridy, P. C., Li, Y., Keegan, S., Thompson, M. K., Nudelman, I., Scheid, J. F., Oeffinger, M., Nussenzweig, M. C., Fenyö, D., Chait, B. T., and Rout, M. P. (2014) A robust pipeline for rapid production of versatile nanobody repertoires. Nat. Methods. 11, 1253–1260 72. Roth, D. B., and Craig, N. L. (1998) VDJ Recombination. Cell. 94, 411–414 73. Di Noia, J. M., and Neuberger, M. S. (2007) Molecular mechanisms of antibody somatic hypermutation. Annu. Rev. Biochem. 76, 1–22 74. Guthals, A., Gan, Y., Murray, L., Chen, Y., Stinson, J., Nakamura, G. R., Lill, J. R., Sandoval, W., and Bandeira, N. (2016) De Novo MS/MS Sequencing of Native Human Antibodies. J. Proteome Res. 10.1021/acs.jproteome.6b00608

40

75. Guthals, A., Clauser, K. R., Frank, A. M., and Bandeira, N. (2013) Sequencing-grade de novo analysis of MS/MS triplets (CID/HCD/ETD) from overlapping peptides. J. Proteome Res. 12, 2846–2857 76. Guthals, A., Clauser, K. R., and Bandeira, N. (2012) Shotgun protein sequencing with metacontig assembly. Mol. Cell. Proteomics MCP. 11, 1084–1096 77. Tran, N. H., Rahman, M. Z., He, L., Xin, L., Shan, B., and Li, M. (2016) Complete De Novo Assembly of Monoclonal Antibody Sequences. Sci. Rep. 6, 31730 78. Vincke, C., Loris, R., Saerens, D., Martinez-Rodriguez, S., Muyldermans, S., and Conrath, K. (2009) General strategy to humanize a camelid single-domain antibody and identification of a universal humanized nanobody scaffold. J. Biol. Chem. 284, 3273–3284 79. Arias, J. L., Unciti-Broceta, J. D., Maceira, J., Del Castillo, T., Hernández-Quero, J., Magez, S., Soriano, M., and García-Salcedo, J. A. (2015) Nanobody conjugated PLGA nanoparticles for active targeting of African Trypanosomiasis. J. Control. Release Off. J. Control. Release Soc. 197, 190–198 80. Davis, Z. H., Verschueren, E., Jang, G. M., Kleffman, K., Johnson, J. R., Park, J., Von Dollen, J., Maher, M. C., Johnson, T., Newton, W., Jäger, S., Shales, M., Horner, J., Hernandez, R. D., Krogan, N. J., and Glaunsinger, B. A. (2015) Global mapping of herpesvirus-host protein complexes reveals a transcription strategy for late genes. Mol. Cell. 57, 349–360 81. Jäger, S., Kim, D. Y., Hultquist, J. F., Shindo, K., LaRue, R. S., Kwon, E., Li, M., Anderson, B. D., Yen, L., Stanley, D., Mahon, C., Kane, J., Franks-Skiba, K., Cimermancic, P., Burlingame, A., Sali, A., Craik, C. S., Harris, R. S., Gross, J. D., and Krogan, N. J. (2011) Vif hijacks CBF-β to degrade APOBEC3G and promote HIV-1 infection. Nature. 481, 371–375 82. Jean Beltran, P. M., Mathias, R. A., and Cristea, I. M. (2016) A Portrait of the Human Organelle Proteome In Space and Time during Cytomegalovirus Infection. Cell Syst. 3, 361–373.e6 83. Luo, Y., Jacobs, E. Y., Greco, T. M., Mohammed, K. D., Tong, T., Keegan, S., Binley, J. M., Cristea, I. M., Fenyö, D., Rout, M. P., Chait, B. T., and Muesing, M. A. (2016) HIV-host interactome revealed directly from infected cells. Nat. Microbiol. 1, 16068 84. Crawford, D. H., Johannessen, I., and Rickinson, A. B. (2014) Cancer Virus: The discovery of the Epstein-Barr Virus, 1 edition, Oxford University Press, Oxford, United Kingdom  ; New York 85. Huang, C. R. L., Burns, K. H., and Boeke, J. D. (2012) Active transposition in genomes. Annu. Rev. Genet. 46, 651–675 86. Rodić, N., Sharma, R., Sharma, R., Zampella, J., Dai, L., Taylor, M. S., Hruban, R. H., Iacobuzio-Donahue, C. A., Maitra, A., Torbenson, M. S., Goggins, M., Shih, I.-M., Duffield, A. S., Montgomery, E. A., Gabrielson, E., Netto, G. J., Lotan, T. L., De Marzo, A. M., Westra, W., Binder, Z. A., Orr, B. A., Gallia, G. L., Eberhart, C. G., Boeke, J. D., Harris, C. R., and Burns, K. H. (2014) Long interspersed element-1 protein expression is a hallmark of many human cancers. Am. J. Pathol. 184, 1280–1286 87. Ardeljan, D., Taylor, M. S., Burns, K. H., Boeke, J. D., Espey, M. G., Woodhouse, E. C., and Howcroft, T. K. (2016) Meeting Report: The Role of the Mobilome in Cancer. Cancer Res. 76, 4316–4319 88. Burns, K. H., and Boeke, J. D. (2012) Human transposon tectonics. Cell. 149, 740–752 89. LINE-1 ORF1 Observations in GPMDB http://gpmdb.thegpm.org/protein/accession/gi%7C74753422%7C 90. Banfield, J. F., Verberkmoes, N. C., Hettich, R. L., and Thelen, M. P. (2005) Proteogenomic approaches for the molecular characterization of natural microbial communities. Omics J. Integr. Biol. 9, 301–333

41

91. Wilmes, P., and Bond, P. L. (2004) The application of two-dimensional polyacrylamide gel electrophoresis and downstream analyses to a mixed community of prokaryotic microorganisms. Environ. Microbiol. 6, 911–920 92. Hug, L. A., Baker, B. J., Anantharaman, K., Brown, C. T., Probst, A. J., Castelle, C. J., Butterfield, C. N., Hernsdorf, A. W., Amano, Y., Ise, K., Suzuki, Y., Dudek, N., Relman, D. A., Finstad, K. M., Amundson, R., Thomas, B. C., and Banfield, J. F. (2016) A new view of the tree of life. Nat. Microbiol. 1, 16048 93. Locey, K. J., and Lennon, J. T. (2016) Scaling laws predict global microbial diversity. Proc. Natl. Acad. Sci. U. S. A. 113, 5970–5975 94. Tripp, H. J., Sutton, G., White, O., Wortman, J., Pati, A., Mikhailova, N., Ovchinnikova, G., Payne, S. H., Kyrpides, N. C., and Ivanova, N. (2015) Toward a standard in structural genome annotation for prokaryotes. Stand. Genomic Sci. 10, 45 95. Horlacher, O., Lisacek, F., and Müller, M. (2016) Mining Large Scale Tandem Mass Spectrometry Data for Protein Modifications Using Spectral Libraries. J. Proteome Res. 15, 721–731 96. Na, S., Payne, S. H., and Bandeira, N. (2016) Multi-species Identification of Polymorphic Peptide Variants via Propagation in Spectral Networks. Mol. Cell. Proteomics MCP. 15, 3501–3512 97. Ye, D., Fu, Y., Sun, R.-X., Wang, H.-P., Yuan, Z.-F., Chi, H., and He, S.-M. (2010) Open MS/MS spectral library search to identify unanticipated post-translational modifications and increase spectral identification rate. Bioinforma. Oxf. Engl. 26, i399-406 98. Wilmes, P., Heintz-Buschart, A., and Bond, P. L. (2015) A decade of metaproteomics: where we stand and what the future holds. Proteomics. 15, 3409–3417 99. Gry, M., Rimini, R., Strömberg, S., Asplund, A., Pontén, F., Uhlén, M., and Nilsson, P. (2009) Correlations between RNA and protein expression profiles in 23 human cell lines. BMC Genomics. 10, 365 100. Hundertmark, C., Fischer, R., Reinl, T., May, S., Klawonn, F., and Jänsch, L. (2009) MSspecific noise model reveals the potential of iTRAQ in quantitative proteomics. Bioinforma. Oxf. Engl. 25, 1004–1011 101. Lahens, N. F., Kavakli, I. H., Zhang, R., Hayer, K., Black, M. B., Dueck, H., Pizarro, A., Kim, J., Irizarry, R., Thomas, R. S., Grant, G. R., and Hogenesch, J. B. (2014) IVT-seq reveals extreme bias in RNA sequencing. Genome Biol. 15, R86 102. Jovanovic, M., Rooney, M. S., Mertins, P., Przybylski, D., Chevrier, N., Satija, R., Rodriguez, E. H., Fields, A. P., Schwartz, S., Raychowdhury, R., Mumbach, M. R., Eisenhaure, T., Rabani, M., Gennert, D., Lu, D., Delorey, T., Weissman, J. S., Carr, S. A., Hacohen, N., and Regev, A. (2015) Immunogenetics. Dynamic profiling of the protein life cycle in response to pathogens. Science. 347, 1259038 103. Li, J. J., and Biggin, M. D. (2015) Gene expression. Statistics requantitates the central dogma. Science. 347, 1066–1067 104. Wang, J., Ma, Z., Carr, S. A., Mertins, P., Zhang, H., Zhang, Z., Chan, D. W., Ellis, M. J. C., Townsend, R. R., Smith, R. D., McDermott, J. E., Chen, X., Paulovich, A. G., Boja, E. S., Mesri, M., Kinsinger, C. R., Rodriguez, H., Rodland, K. D., Liebler, D. C., and Zhang, B. (2016) Proteome profiling outperforms transcriptome profiling for co-expression based gene function prediction. Mol. Cell. Proteomics MCP. 10.1074/mcp.M116.060301 105. Gilad, Y., Rifkin, S. A., and Pritchard, J. K. (2008) Revealing the architecture of gene regulation: the promise of eQTL studies. Trends Genet. TIG. 24, 408–415 106. Nica, A. C., and Dermitzakis, E. T. (2013) Expression quantitative trait loci: present and future. Philos. Trans. R. Soc. Lond. B. Biol. Sci. 368, 20120362 107. Bartel, D. P. (2004) MicroRNAs: genomics, biogenesis, mechanism, and function. Cell. 116, 281–297

42

108. Guo, H., Ingolia, N. T., Weissman, J. S., and Bartel, D. P. (2010) Mammalian microRNAs predominantly act to decrease target mRNA levels. Nature. 466, 835–840 109. Selbach, M., Schwanhäusser, B., Thierfelder, N., Fang, Z., Khanin, R., and Rajewsky, N. (2008) Widespread changes in protein synthesis induced by microRNAs. Nature. 455, 58–63 110. Baek, D., Villén, J., Shin, C., Camargo, F. D., Gygi, S. P., and Bartel, D. P. (2008) The impact of microRNAs on protein output. Nature. 455, 64–71 111. Bush, W. S., and Moore, J. H. (2012) Chapter 11: Genome-wide association studies. PLoS Comput. Biol. 8, e1002822 112. Shastry, B. S. (2007) SNPs in disease gene mapping, medicinal drug development and evolution. J. Hum. Genet. 52, 871–880 113. Erxleben, C., Liao, Y., Gentile, S., Chin, D., Gomez-Alegria, C., Mori, Y., Birnbaumer, L., and Armstrong, D. L. (2006) Cyclosporin and Timothy syndrome increase mode 2 gating of CaV1.2 calcium channels through aberrant phosphorylation of S6 helices. Proc. Natl. Acad. Sci. U. S. A. 103, 3932–3937 114. Gentile, S., Martin, N., Scappini, E., Williams, J., Erxleben, C., and Armstrong, D. L. (2008) The human ERG1 channel polymorphism, K897T, creates a phosphorylation site that inhibits channel activity. Proc. Natl. Acad. Sci. U. S. A. 105, 14704–14708 115. Keegan, S., Cortens, J. P., Beavis, R. C., and Fenyö, D. (2016) g2pDB: A Database Mapping Protein Post-Translational Modifications to Genomic Coordinates. J. Proteome Res. 15, 983–990 116. Yang, C.-Y., Chang, C.-H., Yu, Y.-L., Lin, T.-C. E., Lee, S.-A., Yen, C.-C., Yang, J.-M., Lai, J.-M., Hong, Y.-R., Tseng, T.-L., Chao, K.-M., and Huang, C.-Y. F. (2008) PhosphoPOINT: a comprehensive human kinase interactome and phospho-protein database. Bioinforma. Oxf. Engl. 24, i14-20 117. UniProt Consortium (2015) UniProt: a hub for protein information. Nucleic Acids Res. 43, D204-212 118. Dinkel, H., Chica, C., Via, A., Gould, C. M., Jensen, L. J., Gibson, T. J., and Diella, F. (2010) Phospho.ELM: a database of phosphorylation sites—update 2011. Nucleic Acids Res. 10.1093/nar/gkq1104 119. Keshava Prasad, T. S., Goel, R., Kandasamy, K., Keerthikumar, S., Kumar, S., Mathivanan, S., Telikicherla, D., Raju, R., Shafreen, B., Venugopal, A., Balakrishnan, L., Marimuthu, A., Banerjee, S., Somanathan, D. S., Sebastian, A., Rani, S., Ray, S., Harrys Kishore, C. J., Kanth, S., Ahmed, M., Kashyap, M. K., Mohmood, R., Ramachandra, Y. L., Krishna, V., Rahiman, B. A., Mohan, S., Ranganathan, P., Ramabadran, S., Chaerkady, R., and Pandey, A. (2009) Human Protein Reference Database--2009 update. Nucleic Acids Res. 37, D767-772 120. Sherry, S. T., Ward, M. H., Kholodov, M., Baker, J., Phan, L., Smigielski, E. M., and Sirotkin, K. (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 121. Reimand, J., and Bader, G. D. (2013) Systematic analysis of somatic mutations in phosphorylation signaling predicts novel cancer drivers. Mol. Syst. Biol. 9, 637 122. Hornbeck, P. V., Zhang, B., Murray, B., Kornhauser, J. M., Latham, V., and Skrzypek, E. (2015) PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic Acids Res. 43, D512-520 123. Craig, R., Cortens, J. P., and Beavis, R. C. (2004) Open source system for analyzing, validating, and storing protein identification data. J. Proteome Res. 3, 1234–1242 124. Fenyö, D., and Beavis, R. C. (2015) The GPMDB REST interface. Bioinforma. Oxf. Engl. 31, 2056–2058

43

125. Monti, S., Tamayo, P., Mesirov, J., and Golub, T. (2003) Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data. Mach. Learn. 52, 91–118 126. Rousseeuw, P. J. (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 127. Tusher, V. G., Tibshirani, R., and Chu, G. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. U. S. A. 98, 5116–5121 128. Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Paulovich, A., Pomeroy, S. L., Golub, T. R., Lander, E. S., and Mesirov, J. P. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genomewide expression profiles. Proc. Natl. Acad. Sci. U. S. A. 102, 15545–15550 129. Barbie, D. A., Tamayo, P., Boehm, J. S., Kim, S. Y., Moody, S. E., Dunn, I. F., Schinzel, A. C., Sandy, P., Meylan, E., Scholl, C., Fröhling, S., Chan, E. M., Sos, M. L., Michel, K., Mermel, C., Silver, S. J., Weir, B. A., Reiling, J. H., Sheng, Q., Gupta, P. B., Wadlow, R. C., Le, H., Hoersch, S., Wittner, B. S., Ramaswamy, S., Livingston, D. M., Sabatini, D. M., Meyerson, M., Thomas, R. K., Lander, E. S., Mesirov, J. P., Root, D. E., Gilliland, D. G., Jacks, T., and Hahn, W. C. (2009) Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature. 462, 108–112 130. Daniel, W. W. (1990) Spearman rank correlation coefficient. in Applied Nonparametric Statistics, 2nd Ed., pp. 358–365, PWS-Kent, Boston 131. Wang, D., and Gu, J. (2016) Integrative clustering methods of multi-omics data for molecule-based cancer classifications. Quant. Biol. 4, 58–67 132. Mo, Q., Wang, S., Seshan, V. E., Olshen, A. B., Schultz, N., Sander, C., Powers, R. S., Ladanyi, M., and Shen, R. (2013) Pattern discovery and cancer gene identification in integrated cancer genomic data. Proc. Natl. Acad. Sci. U. S. A. 110, 4245–4250 133. Wu, D., Wang, D., Zhang, M. Q., and Gu, J. (2015) Fast dimension reduction and integrative clustering of multi-omics data using low-rank approximation: application to cancer molecular classification. BMC Genomics. 16, 1022 134. Meng, C., Helm, D., Frejno, M., and Kuster, B. (2016) moCluster: Identifying Joint Patterns Across Multiple Omics Data Sets. J. Proteome Res. 15, 755–765 135. Hoadley, K. A., Yau, C., Wolf, D. M., Cherniack, A. D., Tamborero, D., Ng, S., Leiserson, M. D. M., Niu, B., McLellan, M. D., Uzunangelov, V., Zhang, J., Kandoth, C., Akbani, R., Shen, H., Omberg, L., Chu, A., Margolin, A. A., Van’t Veer, L. J., Lopez-Bigas, N., Laird, P. W., Raphael, B. J., Ding, L., Robertson, A. G., Byers, L. A., Mills, G. B., Weinstein, J. N., Van Waes, C., Chen, Z., Collisson, E. A., Cancer Genome Atlas Research Network, Benz, C. C., Perou, C. M., and Stuart, J. M. (2014) Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell. 158, 929– 944 136. Wang, B., Mezlini, A. M., Demir, F., Fiume, M., Tu, Z., Brudno, M., Haibe-Kains, B., and Goldenberg, A. (2014) Similarity network fusion for aggregating data types on a genomic scale. Nat. Methods. 11, 333–337 137. Vaske, C. J., Benz, S. C., Sanborn, J. Z., Earl, D., Szeto, C., Zhu, J., Haussler, D., and Stuart, J. M. (2010) Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinforma. Oxf. Engl. 26, i237-245 138. Petralia, F., Wang, P., Yang, J., and Tu, Z. (2015) Integrative random forest for gene regulatory network inference. Bioinforma. Oxf. Engl. 31, i197-205 139. Bersanelli, M., Mosca, E., Remondini, D., Giampieri, E., Sala, C., Castellani, G., and Milanesi, L. (2016) Methods for the integration of multi-omics data: mathematical aspects. BMC Bioinformatics. 17 Suppl 2, 15 140. Guyon, I., Weston, J., Barnhill, S., and Vapnik, V. (2002) Gene Selection for Cancer Classification using Support Vector Machines. Mach. Learn. 46, 389–422

44

141. Ma, J., Stingo, F. C., and Hobbs, B. P. (2016) Bayesian predictive modeling for genomic based personalized treatment selection. Biometrics. 72, 575–583 142. Breiman, L. (2001) Random Forests. Mach. Learn. 45, 5–32 143. Janssens, A. C. J. W., Aulchenko, Y. S., Elefante, S., Borsboom, G. J. J. M., Steyerberg, E. W., and van Duijn, C. M. (2006) Predictive testing for complex diseases using multiple genes: fact or fiction? Genet. Med. Off. J. Am. Coll. Med. Genet. 8, 395–400 144. Daemen, A., Griffith, O. L., Heiser, L. M., Wang, N. J., Enache, O. M., Sanborn, Z., Pepin, F., Durinck, S., Korkola, J. E., Griffith, M., Hur, J. S., Huh, N., Chung, J., Cope, L., Fackler, M. J., Umbricht, C., Sukumar, S., Seth, P., Sukhatme, V. P., Jakkula, L. R., Lu, Y., Mills, G. B., Cho, R. J., Collisson, E. A., van’t Veer, L. J., Spellman, P. T., and Gray, J. W. (2013) Modeling precision treatment of breast cancer. Genome Biol. 14, R110 145. Garnett, M. J., Edelman, E. J., Heidorn, S. J., Greenman, C. D., Dastur, A., Lau, K. W., Greninger, P., Thompson, I. R., Luo, X., Soares, J., Liu, Q., Iorio, F., Surdez, D., Chen, L., Milano, R. J., Bignell, G. R., Tam, A. T., Davies, H., Stevenson, J. A., Barthorpe, S., Lutz, S. R., Kogera, F., Lawrence, K., McLaren-Douglas, A., Mitropoulos, X., Mironenko, T., Thi, H., Richardson, L., Zhou, W., Jewitt, F., Zhang, T., O’Brien, P., Boisvert, J. L., Price, S., Hur, W., Yang, W., Deng, X., Butler, A., Choi, H. G., Chang, J. W., Baselga, J., Stamenkovic, I., Engelman, J. A., Sharma, S. V., Delattre, O., Saez-Rodriguez, J., Gray, N. S., Settleman, J., Futreal, P. A., Haber, D. A., Stratton, M. R., Ramaswamy, S., McDermott, U., and Benes, C. H. (2012) Systematic identification of genomic markers of drug sensitivity in cancer cells. Nature. 483, 570–575 146. Sos, M. L., Michel, K., Zander, T., Weiss, J., Frommolt, P., Peifer, M., Li, D., Ullrich, R., Koker, M., Fischer, F., Shimamura, T., Rauh, D., Mermel, C., Fischer, S., Stückrath, I., Heynck, S., Beroukhim, R., Lin, W., Winckler, W., Shah, K., LaFramboise, T., Moriarty, W. F., Hanna, M., Tolosi, L., Rahnenführer, J., Verhaak, R., Chiang, D., Getz, G., Hellmich, M., Wolf, J., Girard, L., Peyton, M., Weir, B. A., Chen, T.-H., Greulich, H., Barretina, J., Shapiro, G. I., Garraway, L. A., Gazdar, A. F., Minna, J. D., Meyerson, M., Wong, K.-K., and Thomas, R. K. (2009) Predicting drug susceptibility of non-small cell lung cancers based on genetic lesions. J. Clin. Invest. 119, 1727–1740 147. Deeb, S. J., Tyanova, S., Hummel, M., Schmidt-Supprian, M., Cox, J., and Mann, M. (2015) Machine Learning-based Classification of Diffuse Large B-cell Lymphoma Patients by Their Protein Expression Profiles. Mol. Cell. Proteomics MCP. 14, 2947–2960 148. Tyanova, S., Albrechtsen, R., Kronqvist, P., Cox, J., Mann, M., and Geiger, T. (2016) Proteomic maps of breast cancer subtypes. Nat. Commun. 7, 10259 149. Iglesias-Gato, D., Wikström, P., Tyanova, S., Lavallee, C., Thysell, E., Carlsson, J., Hägglöf, C., Cox, J., Andrén, O., Stattin, P., Egevad, L., Widmark, A., Bjartell, A., Collins, C. C., Bergh, A., Geiger, T., Mann, M., and Flores-Morales, A. (2016) The Proteome of Primary Prostate Cancer. Eur. Urol. 69, 942–952 150. Gonzalez-Angulo, A. M., Hennessy, B. T., Meric-Bernstam, F., Sahin, A., Liu, W., Ju, Z., Carey, M. S., Myhre, S., Speers, C., Deng, L., Broaddus, R., Lluch, A., Aparicio, S., Brown, P., Pusztai, L., Symmans, W. F., Alsner, J., Overgaard, J., Borresen-Dale, A.-L., Hortobagyi, G. N., Coombes, K. R., and Mills, G. B. (2011) Functional proteomics can define prognosis and predict pathologic complete response in patients with breast cancer. Clin. Proteomics. 8, 11 151. Niepel, M., Hafner, M., Pace, E. A., Chung, M., Chai, D. H., Zhou, L., Schoeberl, B., and Sorger, P. K. (2013) Profiles of Basal and stimulated receptor signaling networks predict drug response in breast cancer lines. Sci. Signal. 6, ra84 152. Timpe, L. C., Li, D., Yen, T.-Y., Wong, J., Yen, R., Macher, B. A., and Piryatinska, A. (2015) Mining the Breast Cancer Proteome for Predictors of Drug Sensitivity. J. Proteomics Bioinform. 8, 204–211

45

153. Smyth, G. K. (2004) Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments. Stat. Appl. Genet. Mol. Biol. 3, 1–25 154. Aerts, S., Lambrechts, D., Maity, S., Van Loo, P., Coessens, B., De Smet, F., Tranchevent, L.-C., De Moor, B., Marynen, P., Hassan, B., Carmeliet, P., and Moreau, Y. (2006) Gene prioritization through genomic data fusion. Nat. Biotechnol. 24, 537–544 155. Kolde, R., Laur, S., Adler, P., and Vilo, J. (2012) Robust rank aggregation for gene list integration and meta-analysis. Bioinformatics. 28, 573–580 156. Beadle, G. W., and Tatum, E. L. (1941) Genetic Control of Biochemical Reactions in Neurospora. Proc. Natl. Acad. Sci. U. S. A. 27, 499–506 157. Bensimon, A., Heck, A. J. R., and Aebersold, R. (2012) Mass Spectrometry–Based Proteomics and Network Biology. Annu. Rev. Biochem. 81, 379–405 158. Vidal, M., Cusick, M. E., and Barabási, A.-L. (2011) Interactome networks and human disease. Cell. 144, 986–998 159. Arkin, A. P., and Schaffer, D. V. (2011) Network news: innovations in 21st century systems biology. Cell. 144, 844–849 160. Ideker, T., and Krogan, N. J. (2012) Differential network biology. Mol. Syst. Biol. 8, 565 161. Hu, J. X., Thomas, C. E., and Brunak, S. (2016) Network biology concepts in complex disease comorbidities. Nat. Rev. Genet. 10.1038/nrg.2016.87 162. Vidal, M. (2001) A biological atlas of functional maps. Cell. 104, 333–339 163. Dennis, G., Sherman, B. T., Hosack, D. A., Yang, J., Gao, W., Lane, H. C., and Lempicki, R. A. (2003) DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 4, P3 164. Wang, J., Duncan, D., Shi, Z., and Zhang, B. (2013) WEB-based GEne SeT AnaLysis Toolkit (WebGestalt): update 2013. Nucleic Acids Res. 41, W77-83 165. Daniel, W. W. (1990) Kolmogorov-Smirnov one-sample test. in Applied Nonparametric Statistics, 2nd Ed., pp. 319–330, PWS-Keng, Boston 166. Tarca, A. L., Draghici, S., Khatri, P., Hassan, S. S., Mittal, P., Kim, J.-S., Kim, C. J., Kusanovic, J. P., and Romero, R. (2009) A novel signaling pathway impact analysis. Bioinforma. Oxf. Engl. 25, 75–82 167. Weidner, C., Fischer, C., and Sauer, S. (2014) PHOXTRACK-a tool for interpreting comprehensive datasets of post-translational modifications of proteins. Bioinforma. Oxf. Engl. 30, 3410–3411 168. Cerami, E., Gao, J., Dogrusoz, U., Gross, B. E., Sumer, S. O., Aksoy, B. A., Jacobsen, A., Byrne, C. J., Heuer, M. L., Larsson, E., Antipin, Y., Reva, B., Goldberg, A. P., Sander, C., and Schultz, N. (2012) The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov. 2, 401–404 169. Aken, B. L., Achuthan, P., Akanni, W., Amode, M. R., Bernsdorff, F., Bhai, J., Billis, K., Carvalho-Silva, D., Cummins, C., Clapham, P., Gil, L., Girón, C. G., Gordon, L., Hourlier, T., Hunt, S. E., Janacek, S. H., Juettemann, T., Keenan, S., Laird, M. R., Lavidas, I., Maurel, T., McLaren, W., Moore, B., Murphy, D. N., Nag, R., Newman, V., Nuhn, M., Ong, C. K., Parker, A., Patricio, M., Riat, H. S., Sheppard, D., Sparrow, H., Taylor, K., Thormann, A., Vullo, A., Walts, B., Wilder, S. P., Zadissa, A., Kostadima, M., Martin, F. J., Muffato, M., Perry, E., Ruffier, M., Staines, D. M., Trevanion, S. J., Cunningham, F., Yates, A., Zerbino, D. R., and Flicek, P. (2017) Ensembl 2017. Nucleic Acids Res. 45, D635–D642 170. Robinson, J. T., Thorvaldsdóttir, H., Winckler, W., Guttman, M., Lander, E. S., Getz, G., and Mesirov, J. P. (2011) Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 171. Kalume, D. E., Peri, S., Reddy, R., Zhong, J., Okulate, M., Kumar, N., and Pandey, A. (2005) Genome annotation of Anopheles gambiae using mass spectrometry-derived data. BMC Genomics. 6, 128

46

172. Ferro, M., Tardif, M., Reguer, E., Cahuzac, R., Bruley, C., Vermat, T., Nugues, E., Vigouroux, M., Vandenbrouck, Y., Garin, J., and Viari, A. (2008) PepLine: a software pipeline for high-throughput direct mapping of tandem mass spectrometry data on genomic sequences. J. Proteome Res. 7, 1873–1883 173. Sanders, W. S., Wang, N., Bridges, S. M., Malone, B. M., Dandass, Y. S., McCarthy, F. M., Nanduri, B., Lawrence, M. L., and Burgess, S. C. (2011) The proteogenomic mapping tool. BMC Bioinformatics. 12, 115 174. Desiere, F., Deutsch, E. W., Nesvizhskii, A. I., Mallick, P., King, N. L., Eng, J. K., Aderem, A., Boyle, R., Brunner, E., Donohoe, S., Fausto, N., Hafen, E., Hood, L., Katze, M. G., Kennedy, K. A., Kregenow, F., Lee, H., Lin, B., Martin, D., Ranish, J. A., Rawlings, D. J., Samelson, L. E., Shiio, Y., Watts, J. D., Wollscheid, B., Wright, M. E., Yan, W., Yang, L., Yi, E. C., Zhang, H., and Aebersold, R. (2005) Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometry. Genome Biol. 6, R9 175. Kuhring, M., and Renard, B. Y. (2012) iPiG: integrating peptide spectrum matches into genome browser visualizations. PloS One. 7, e50246 176. Pang, C. N. I., Tay, A. P., Aya, C., Twine, N. A., Harkness, L., Hart-Smith, G., Chia, S. Z., Chen, Z., Deshpande, N. P., Kaakoush, N. O., Mitchell, H. M., Kassem, M., and Wilkins, M. R. (2014) Tools to covisualize and coanalyze proteomic data with genomes and transcriptomes: validation of genes and alternative mRNA splicing. J. Proteome Res. 13, 84–98 177. Wang, X., Slebos, R. J. C., Chambers, M. C., Tabb, D. L., Liebler, D. C., and Zhang, B. (2016) proBAMsuite, a Bioinformatics Framework for Genome-Based Representation and Analysis of Proteomics Data. Mol. Cell. Proteomics MCP. 15, 1164–1175 178. Askenazi, M., Ruggles, K. V., and Fenyö, D. (2016) PGx: Putting Peptides to BED. J. Proteome Res. 15, 795–799 179. Guo, F., Wang, D., Liu, Z., Lu, L., Zhang, W., Sun, H., Zhang, H., Ma, J., Wu, S., Li, N., Jiang, Y., Zhu, W., Qin, J., Xu, P., Li, D., and He, F. (2013) CAPER: a chromosomeassembled human proteome browsER. J. Proteome Res. 12, 179–186 180. Nagaraj, S. H., Waddell, N., Madugundu, A. K., Wood, S., Jones, A., Mandyam, R. A., Nones, K., Pearson, J. V., and Grimmond, S. M. (2015) PGTools: A Software Suite for Proteogenomic Data Analysis and Visualization. J. Proteome Res. 14, 2255–2266 181. Ghali, F., Krishna, R., Perkins, S., Collins, A., Xia, D., Wastling, J., and Jones, A. R. (2014) ProteoAnnotator--open source proteogenomics annotation software supporting PSI standards. Proteomics. 14, 2731–2741 182. Halvey, P. J., Wang, X., Wang, J., Bhat, A. A., Dhawan, P., Li, M., Zhang, B., Liebler, D. C., and Slebos, R. J. C. (2014) Proteogenomic analysis reveals unanticipated adaptations of colorectal tumor cells to deficiencies in DNA mismatch repair. Cancer Res. 74, 387– 397 183. Gholami, A. M., Hahne, H., Wu, Z., Auer, F. J., Meng, C., Wilhelm, M., and Kuster, B. (2013) Global proteome analysis of the NCI-60 cell line panel. Cell Rep. 4, 609–620 184. Peterson, E. S., McCue, L. A., Schrimpe-Rutledge, A. C., Jensen, J. L., Walker, H., Kobold, M. A., Webb, S. R., Payne, S. H., Ansong, C., Adkins, J. N., Cannon, W. R., and Webb-Robertson, B.-J. M. (2012) VESPA: software to facilitate genomic annotation of prokaryotic organisms through integration of proteomic and transcriptomic data. BMC Genomics. 13, 131 185. Askenazi, M., and Fenyö, D. (2016) OpenSlice: Quantitative data sharing from HyperPeaks to global ion chromatograms (GICs). Proteomics. 16, 2495–2501 186. Gehlenborg, N., O’Donoghue, S. I., Baliga, N. S., Goesmann, A., Hibbs, M. A., Kitano, H., Kohlbacher, O., Neuweger, H., Schneider, R., Tenenbaum, D., and Gavin, A.-C. (2010) Visualization of omics data for systems biology. Nat. Methods. 7, S56-68

47

187. Smoot, M. E., Ono, K., Ruscheinski, J., Wang, P.-L., and Ideker, T. (2011) Cytoscape 2.8: new features for data integration and network visualization. Bioinforma. Oxf. Engl. 27, 431–432 188. Hu, Z., Hung, J.-H., Wang, Y., Chang, Y.-C., Huang, C.-L., Huyck, M., and DeLisi, C. (2009) VisANT 3.5: multi-scale network visualization, analysis and inference based on the gene ontology. Nucleic Acids Res. 37, W115-121 189. Cline, M. S., Smoot, M., Cerami, E., Kuchinsky, A., Landys, N., Workman, C., Christmas, R., Avila-Campilo, I., Creech, M., Gross, B., Hanspers, K., Isserlin, R., Kelley, R., Killcoyne, S., Lotia, S., Maere, S., Morris, J., Ono, K., Pavlovic, V., Pico, A. R., Vailaya, A., Wang, P.-L., Adler, A., Conklin, B. R., Hood, L., Kuiper, M., Sander, C., Schmulevich, I., Schwikowski, B., Warner, G. J., Ideker, T., and Bader, G. D. (2007) Integration of biological networks and gene expression data using Cytoscape. Nat. Protoc. 2, 2366– 2382 190. Barsky, A., Munzner, T., Gardy, J., and Kincaid, R. (2008) Cerebral: visualizing multiple experimental conditions on a graph with biological context. IEEE Trans. Vis. Comput. Graph. 14, 1253–1260 191. Kincaid, R., Kuchinsky, A., and Creech, M. (2008) VistaClara: an expression browser plug-in for Cytoscape. Bioinforma. Oxf. Engl. 24, 2112–2114 192. Bourqui, R., and Westenberg, M. A. (2009) Visualizing Temporal Dynamics at the Genomic and Metabolic Level (Banissi, E., Stuart, L., Wyeld, T. G., Jern, M., Andrienko, G., Memon, N., Alhajj, R., Burkhard, R. A., Grinstein, G., Groth, D., Ursyn, A., Johansson, J., Forsell, C., Cvek, U., Trutschl, M., Marchese, F. T., Maple, C., Cowell, A. J., and Moere, A. V. eds), Ieee Computer Soc, Los Alamitos 193. Rohn, H., Junker, A., Hartmann, A., Grafahrend-Belau, E., Treutler, H., Klapperstück, M., Czauderna, T., Klukas, C., and Schreiber, F. (2012) VANTED v2: a framework for systems biology applications. BMC Syst. Biol. 6, 139 194. Schreiber, F., Dwyer, T., Marriott, K., and Wybrow, M. (2009) A generic algorithm for layout of biological networks. BMC Bioinformatics. 10, 375 195. Pavlopoulos, G. A., O’Donoghue, S. I., Satagopam, V. P., Soldatos, T. G., Pafilis, E., and Schneider, R. (2008) Arena3D: visualization of biological networks in 3D. BMC Syst. Biol. 2, 104 196. Shi, Z., Wang, J., and Zhang, B. (2013) NetGestalt: integrating multidimensional omics data over biological networks. Nat. Methods. 10, 597–598 197. Turinsky, A. L., Razick, S., Turner, B., Donaldson, I. M., and Wodak, S. J. (2011) Interaction databases on the same page. Nat. Biotechnol. 29, 391–393 198. Salavert, F., García-Alonso, L., Sánchez, R., Alonso, R., Bleda, M., Medina, I., and Dopazo, J. (2016) Web-based network analysis and visualization using CellMaps. Bioinforma. Oxf. Engl. 10.1093/bioinformatics/btw332 199. Pratt, D., Chen, J., Welker, D., Rivas, R., Pillich, R., Rynkov, V., Ono, K., Miello, C., Hicks, L., Szalma, S., Stojmirovic, A., Dobrin, R., Braxenthaler, M., Kuentzer, J., Demchak, B., and Ideker, T. (2015) NDEx, the Network Data Exchange. Cell Syst. 1, 302–305 200. Eriksson, J., and Fenyö, D. (2007) Improving the success rate of proteome analysis by modeling protein-abundance distributions and experimental designs. Nat. Biotechnol. 25, 651–655 201. Sheynkman, G. M., Johnson, J. E., Jagtap, P. D., Shortreed, M. R., Onsongo, G., Frey, B. L., Griffin, T. J., and Smith, L. M. (2014) Using Galaxy-P to leverage RNA-Seq for the discovery of novel protein variations. BMC Genomics. 15, 703 202. Krasnov, G. S., Dmitriev, A. A., Kudryavtseva, A. V., Shargunov, A. V., Karpov, D. S., Uroshlev, L. A., Melnikova, N. V., Blinov, V. M., Poverennaya, E. V., Archakov, A. I., Lisitsa, A. V., and Ponomarenko, E. A. (2015) PPLine: An Automated Pipeline for SNP,

48

203. 204.

205.

206. 207.

SAP, and Splice Variant Detection in the Context of Proteogenomics. J. Proteome Res. 14, 3729–3737 Wagih, O., Reimand, J., and Bader, G. D. (2015) MIMP: predicting the impact of mutations on kinase-substrate phosphorylation. Nat. Methods. 12, 531–533 Zeeberg, B. R., Feng, W., Wang, G., Wang, M. D., Fojo, A. T., Sunshine, M., Narasimhan, S., Kane, D. W., Reinhold, W. C., Lababidi, S., Bussey, K. J., Riss, J., Barrett, J. C., and Weinstein, J. N. (2003) GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol. 4, R28 Lynn, D. J., Winsor, G. L., Chan, C., Richard, N., Laird, M. R., Barsky, A., Gardy, J. L., Roche, F. M., Chan, T. H. W., Shah, N., Lo, R., Naseer, M., Que, J., Yau, M., Acab, M., Tulpan, D., Whiteside, M. D., Chikatamarla, A., Mah, B., Munzner, T., Hokamp, K., Hancock, R. E. W., and Brinkman, F. S. L. (2008) InnateDB: facilitating systems-level analyses of the mammalian innate immune response. Mol. Syst. Biol. 4, 218 Okuda, S., Yamada, T., Hamajima, M., Itoh, M., Katayama, T., Bork, P., Goto, S., and Kanehisa, M. (2008) KEGG Atlas mapping for global analysis of metabolic pathways. Nucleic Acids Res. 36, W423-426 Ma’ayan, A., Rouillard, A. D., Clark, N. R., Wang, Z., Duan, Q., and Kou, Y. (2014) Lean Big Data Integration in Systems Biology and Systems Pharmacology. Trends Pharmacol. Sci. 35, 450–460

49

FIGURE LEGENDS:

Figure 1. Sequence-centric proteogenomics. Sequencing-based technologies to sequence DNA (whole genome sequencing, WGS; whole exome sequencing, WXS) and RNA (RNA-seq) generate millions of short sequencing reads that are assembled into genomes, exomes or transcriptomes by either de-novo or template-based approaches by alignment to a reference sequence. Sample-specific sequence aberrations are determined and nucleotide sequences are transformed into personalized, amino acid-centric sequence databases. Peptide mass spectra derived by LC-MS/MS analysis from a matching sample are then scored and validated against the personalized database enabling the detection of sample-specific peptide sequences. Depending on the scope of the proteogenomic project, these peptides can then be used to 1) aid genome annotation by detection of peptides in unannotated genome regions; 2) identify tumor-specific mutations translated into the proteome as well as novel protein splice variants; and 3) detect species-specific peptides and in microbial communities.

Figure 2. Proteogenomic relationships. (A) Correlation analysis of mRNA and protein pairs across samples enables the assessment of global correlation structure which typically centers between correlation coefficients of 0.3 and 0.5. (B) Regulatory effects on RNA and protein expression levels caused by copy number aberrations (CNA), genetic variants (eQTL) and microRNAs (miRNAs) can be studied by different correlation-based approaches. CNA cis and trans effects on RNA, protein and PTM expression can be determined by correlating each gene copy number at a given locus to all quantified features in RNA, protein or PTM space across all samples. Expression quantitative trait loci (eQTL) analysis can be used to identify DNA sequence variants affecting RNA/protein expression levels in the sample population being studied. Global miRNA analysis accompanied with mRNA or protein profiling enables the assessment of miRNA mediated regulation of mRNA and protein expression. (C) Integrative 50

analysis of genetic variants and PTM sites like phosphorylation can identify functional consequences of genetic variants at the molecular level. Mutations that directly affect serine, threonine and tyrosine residues can result in destruction or genesis of phosphosites (I); mutations adjacent to phosphosites can result in removal or addition of phosphosites (II) or change the kinase that recognizes the phosphorylation site (III).

Figure 3. Integrative modeling. Overview of sub-topics in integrative modeling of proteogenomic data. (A) Clustering techniques illustrating a schematic of multi-omic hierarchical clustering analysis resulting in the identification of two subtypes, (B) Predictive modeling for disease diagnosis, prognosis, drug response and drug toxicity using multiple data modalities and, (C) proteogenomic pathway and network modeling, including informing network composition and pathway and GO term enrichment.

Figure 4. Genome-based visualization, using proBAM as an example. proBAM is a data format to integrate mass spectrometry data with the genome. In this example, we show the visualization of 10 colorectal cancer cell lines in proBAM format. The PSMs result from a search against a customized database built from matched RNA-Seq data, which are also incorporated into the visualization. (A) Integrative Genomics Viewer (IGV) snapshot visualizes peptides and RNA-Seq reads mapped to KRAS in one window. The upper panel shows proteomic data from 10 colon cancer cell lines indicated by different colors. The bottom three panels illustrate RNA-Seq data from cell lines HCT15, Caco-2 and SW480, respectively. (B) Zoomed-in view of an exon region in KRAS. Similar to RNA-Seq reads (three bottom panels), peptides mapped to the genome can be classified into within exon peptides and junction peptides in the proBAM file (upper panel). (C) The upper panel shows a zoomed-in view of mutations confirmed by both RNA-Seq and proteomic data in KRAS. A G13D mutation in HCT15 and a G12V mutation in SW480 are observed in both transcriptomic (second and

51

fourth panel) and proteomics (first panel) data, whereas wild type peptide is observed in Caco-2 (third panel).

Figure 5. NetGestalt-based analysis of colorectal cancer proteomics data. (a) NetGestalt created a one-dimensional (1-D), linear order of all 12,112 genes in a protein-protein interaction network based on the hierarchical modular organization of the network. The ruler indicates coordinates of the genes in the resulted linear order. (b) Each bar represents a module identified from the network and alternating bar colors (green and orange) are used to distinguish neighboring modules. (c) Colorectal cancer proteomics data visualized as a heat map with each row representing a sample. Red and blue colors in the heat map represent relative over- and under-expression, respectively. All 12,112 genes in the network are visualized in the heat map, and missing values in the proteomic data are indicated in grey color. The samples (rows) of the data are ordered based on the five subtypes visualized beside the heat map. (d) Signed minus log10 transformed p values of the difference between subtype 3 and other subtypes visualized by the bar plot. Missing values are indicated by yellow bars in the bar plot. (e) Significantly overexpressed genes in subtype 3 (FDR