DNA footprinting: 61 .... Figure S1: Network inference is sensitive to RNA half-life and the number of ... âTranslationâ refers to the entire gene cluster enriched in translation, ... Activity (TFA) and to run Bayesian Best Subset Regression (BBSR). .... Selecting the set of TFs that regulate a specific gene involves several steps and ...
Cell Reports, Volume 23
Supplemental Information
Condition-Specific Modeling of Biophysical Parameters Advances Inference of Regulatory Networks Konstantine Tchourine, Christine Vogel, and Richard Bonneau
S1
Supplemental Tables
Table S1: A survey of the databases of regulatory interactions used in this study. Related to Experimental Procedures Section. The first column denotes the name of the interactions database. The second column lists the number of interactions in each database. An interaction may be duplicated within a database if the database contains multiple entries for the same interaction (e.g. coming from different labs or different conditions), so in the parentheses we specify the number of unique interactions for each database. The third column lists the most common types of regulatory evidence in each database, including the number of interactions for each type of evidence. These top types of evidence account for over 98% of interactions in each database. Database
Total number of interactions (unique)
YeastractDirect
55,653 (41,654)
YeastractIndirect
194,083 (172,929)
SGD
33,838 (32,338)
Kemmeren
193,253 (193,253)
Top types of evidence for interactions ChIP (Chip: 47,535, Seq: 2,302, unspecified: 5382), EMSA: 267, DNA footprinting: 61 Microarray expression level (WT vs. TF mutant: 167,288, WT vs. TF overexpression: 15,916, WT vs. TF loss of function mutant: 3,711, WT vs. TF chimera: 1,511), Northern Blot – WT vs. TF mutant: 1,406, Proteomics – WT vs. TF mutant: 993, lacZ – WT vs. TF mutant: 875 ChIP (Chip: 16,638, Seq: 154, unspecified: 11), Microarray expression level: 13,165, computational combinatorial: 3,782 Microarray expression level – WT vs. TF mutant: 193,253
Table S2: Expression data bi-clustering, RNA half-life fitting, and a high-quality gold standard underlie InfereCLaDR’s superior performance. Related to Table 1. Each modification independently outperforms the original Inferelator using the GS (second column). ”Inferelator + MacIsaac” shows the results of the Inferelator when the MacIsaac standard of interactions was used for training and GS for evaluation. Combining all modifications improves performance as compared to using either one of them separately (third column). Columns two and three show the number of times the AUROC measured on GS-fit from the same re-sample was higher for one method than the other, given a pair of Experimental Procedures specified by the row and the column. The fourth column shows the median area under the receiver operating characteristic curve (AUROC). See Experimental Procedures 4.5, 4.6, and Suppl. Experimental Procedures S3.10 for further details. Method InfereCLaDR (GS + Clustering + RNA half-life) Inferelator + GS + Clustering Inferelator + GS + RNA half-life Inferelator + GS Inferelator + MacIsaac Genie3 + GS iRafNet + GS
Re-samples outperforming Inferelator + GS
Re-samples outperformed by InfereCLaDR
Median AUROC
20/20
-
0.775
20/20
20/20
0.745
19/20
19/20
0.746
0/20 0/20 0/20
20/20 20/20 20/20 20/20
0.718 0.615 0.573 0.555
27
Table S3: Top 10 new predicted targets for various TFs. Related to Table 2. The top half of the table lists the 10 TFs that have the highest number of targets (outdegree) in the Gold Standard. For each TF, the number of known targets in the GS is shown in the parentheses. The table then lists for each TF the top 10 predicted interactions that are not in the GS and therefore most likely new. The bottom half displays the median 10 TFs in terms of the number of known targets, out of the 97 TFs with at least one target in the GS. For each TF, targets are listed in the order of decreasing prediction confidence. Superscripts denote whether a given interaction was also present in any of the four large-scale collections of interactions: Yeastract-Direct (1), Yeastract-Indirect (2), SGD (3), and Kemmeren et al. 2014 (4). Bold targets correspond to interactions that are seen in at most one of these four data bases, highlighting completely new interactions. Table 2 displays the precision-based confidence ranks of these interactions, and SuppNetwork3.tsv in Supplemental File SuppData1.zip lists precision and orthogonal validation scores for all 97 TFs in the GS. TF Name
Targets
Rap1p (192)
SEC142 , YEL11 , PMT42,3 , FEN12,3 , ALG72,3 , RBD2, RPL18B1,2,3 , RPL16B1,2,3 , YLR412CA1 , HXK22,3 LYS11,2,3 , HOM22 , BAT12 , HIS22 , STR22 , SDT12 , RIB32 , PSF21,2,3 , POS52 , SRY12 RPS24A1,2,3 , RPL18B1,2,3,4 , RPS16A1,2,3 , RPL8A2,3 , RPS18B1,2,3 , RPS18A2,3 , RPL40B2,3 , RPL42B2,3 , RPS22A2,3,4 , RPL9A2,3 YDR391C2,3,4 , CMK22,3 , YLR257W2 , STF22,4 , RCN22 , HAL51,2 , OXR12 , PNC12,4 , DOA11 , MRP82 YRO22 , UIP2 , YNL194C1,2,4 , GAD12 , JIP42,4 , MSC12,4 , TFS12 , YNL195C2,4 , YJR096W2 , OM452 GAC1, TAH181,2 ISF12 , YNR034W-A1,2 , MBR12 , MRK1, TCB21,3 , GAT22 , YLR460C1,2,3 , ATR11,2 OPI102,3 , GGA12 , RTC32,3 , YRO22,3 , APJ11 , LSB1, YGR127W, YGR250C1,2,3 , CUR11 , SAF1 RPN122,4 , RPT21,4 , PRE102,4 , RPT51,2,4 , RPN71,2,3,4 , RPN12,4 , RPN102,4 , PRE72,4 , YBR062C2,4 , PUP11,2,3,4 YER010C, COA33 , GET22 , BSD21,2 , YAH1, TDA5, MIM12 , YDR541C1 , MGR2, SLC12 TRX2, CRH11,2 , YOL019W1 , RAX21,3 , YOL014W2 , TGL22 , YPS32 , PHM82 , YBL111C, RIB4
Gcn4p (143) Sfp1p (128) Msn2p (61) Sok2p (60) Yap1p (51) Hsf1p (48) Rpn4p (44) Abf1p (33) Tec1p (31) Mga2p (6) Mot3p (6) Ino2p (5) Msn4p (5) Fkh2p (5) Flo8p (5) Rtg3p (4) Stp1p (4) Leu3p (4) Nrg1p (3)
PLB22,4 , TDA42,4 , YLR413W2,4 , ALE1, FAS11,2 , HEM13, SUR22,4 , NEM1, PEX31, FHN1 FIG24 , PRM12 , SAG14 , TIR12 , PAU242,4 , AAC32,3,4 , TIR42,4 , EUG14 , TIR22,3 , FIG12,4 HNM11,4 , SAH11,2,4 , FAS11,2,3,4 , SAM24 , CHO12,4 , EPT12,4 , ADO11,3,4 , OPI32,4 , EHT12,4 , YIP34 YHR022C1,2 , CRS52 , JLP1, SNO42 , FRE72 , PUG12 , YEL073C1,2 , PDC62,3 , GCY12 , FMP482,3 AIM201,2 , HOF11,2 , ALK11,2,3 , CLB11,2 , YMR030W-A, BUD41,2,3 , CDC52 ,YMR001C-A, KIN31 , HST31 FLO91 , TIR4, DED81, PAU7, MSC7, PAU5, ERG24, PAU24, NCP1, YGL108C IDH21,2,3,4 , PDH12,4 , AAT22 , CPR4, ARP3, WTM12 , IDP11,3,4 , URA8, ADE3, SLP1 VHR22,3,4 , MET17, MET32 , SUL1, MET102,3 , MET52,3 , MEP24 , PET8, GNP11,2,3,4 , DAL802,4 OAC12,3,4 , BAT12,3,4 , FRS22 , ILV52,3 , LEU12,3,4 , SSB1, MAE12,3 , PAB13 , YPL260W, ILV1 PDE2, PRM7, VTS1, YLR407W2 , GFD2, YGR079W, YNL095C, TRM5, PHO90, MCH51
28
S2
Supplemental Figures
29
Figure S1: Network inference is sensitive to RNA half-life and the number of condition clusters, and is improved when both are combined in the InfereCLaDR, in terms of both AUPR and AUROC. Related to Figures 1 and 3. 30
Figure S1: Network inference is sensitive to RNA half-life and the number of condition clusters, and is improved when both are combined in the InfereCLaDR, in terms of both AUPR and AUROC. Related to Figures 1 and 3. Precision-recall curves for A) S. cerevisiae, and B) B. subtilis. Each line corresponds to a different pre-set value of RNA half-life, and displays the median precision and recall across 20 Gold Standard (GS) re-samples. C) True positives vs. false positives curves, with each line corresponding to a different pre-set value of RNA half-life. Each line displays the median number of true positives and true negatives across 20 GS resamples. D) AUROC as a function of pre-set RNA half-life. Different lines denote 20 independent GS re-samplings, and colored dots represent the maximum AUROC for a given GS re-sample. E) Network inference performance (measured in AUPR) depends on the number of condition clusters the expression data is split into. The red line shows the AUPR of the confidence scores combined across the clusters, the black dash-dotted line shows the highest AUPR across the clusters, and the dotted blue line shows the lowest AUPR across the clusters. F) The improvement in the ROC curve is a result of the use of a high-quality gold standard (GS), bi-cluster specific network inference, and optimization of bi-cluster specific RNA half-lives. We compare InfereCLaDR (red line) to Inferelator without bi-clustering or half-life optimization (black dotted line), to the Inferelator using the MacIsaac gold standard of interactions (orange dashed line), and to CLR, Genie3 and iRafNet (purple dash-dotted line, blue dashed line and green dash-dotted line, respectively). Each curve is constructed using median precision and recall values across 20 re-samples. G) Terms enriched in the annotation data for each condition cluster (see Experimental Procedures). Larger font corresponds to higher enrichment. Terms with a p-value over 10−20 (after the Bonferroni correction) are excluded. H) A heatmap of median RNA half-lives that maximized network inference performance with respect to Area Under the Receiver Operating Characteristic curive (AUROC), for every bi-cluster (i.e. peaks of every curve in Figure S5). I) Correspondence between the predicted and measured RNA half-lives. Spearman correlation coefficient is shown in the top right corner. ”Translation” refers to the entire gene cluster enriched in translation, whereas ”translation genes” refers only to the genes in that cluster that are involved in protein translation (see Suppl. Experimental Procedures S3.7). A, C-I show analyses performed on yeast data.
31
Figure S2: Outline of the InfereCLaDR workflows used for testing and for creating the final network. Related to Experimental Procedures Section. We used two strategies for splitting the Gold Standard (GS) of interactions. Split A was used to optimize predictions of condition- and gene-specific RNA half-lives, which were then used to predict the final network. Split B was used during method development to evaluate the improvement in network inference accuracy conferred by bi-clustering and estimating RNA half-lives. In both cases, GS was randomly re-sampled into two or three equal sets of interactions. GS-train was used to estimate Transcription Factor Activity (TFA) and to run Bayesian Best Subset Regression (BBSR). GS-fit was used to find the optimal conditionand gene- specific prediction of RNA half-lives, based on maximum area under precision-recall curve (AUPR).
32
Figure S3: Combining multiple sources of evidence creates a consistent and high-quality collection of prior known regulatory interactions. Related to Experimental Procedures Section.
33
Figure S3: Combining multiple sources of evidence creates a consistent and high-quality collection of prior known regulatory interactions. Related to Experimental Procedures Section. Each of the collections of interactions listed in the legend was split into two equal parts, one for training the Inferelator (A-E) or CLR (F) and one for validation. Each line or bar represents statistics calculated on the 50% leave-out set of the corresponding collection of interactions. A-D: Black, gray, purple, and green lines correspond to the collections of interactions described in Table S1. The green line includes only the signed portion of the SGD collection (where positive and negative regulation is pre-specified). Blue line corresponds to the interactions that appear at least once in the Yeastract-Direct collection and at least twice in the Yeastract-Indirect collection, without positive and negative regulation specified. The red line shows the final signed Gold Standard we use throughout this study. A) Precision vs. recall for the six collections of interactions. B) Precision as a function of number of predictions. C) Receiver Operating Characteristic curve for each collection of interactions. D) Number of true positives as a function of the number of predictions. Note that in B and D, performance on the Gold Standard decreases more quickly after several hundred interactions because the positives exhaust all true interactions more quickly due to the smaller size of the Gold Standard as compared to other collections shown. E) The number above each bar indicates the number of interactions in the collection, after removing genes and TFs that do not appear in our RNA expression dataset. The color of each bar represents whether it is an existing dataset or a combination of these datasets, either based solely on knock-out evidence (orange), or on an intersection of knock-out and a ChIP binding evidence (red). The shaded pattern represents whether none, some, or all interactions are signed. Except for (the final) GS, signed interactions came from the signed database in the intersection (Kemmeren or SGD). For the partially signed collection (the rightmost bar), which is a superset of the GS, the same signs as in GS were used when available, and all other interactions were set to 1 (1GS the indicator function of GS, i.e. the GS with all of its interactions set to 1). Standard notation for set theory was used, with ∪ denoting union, ∩ denoting intersection, and \ denoting set difference. F) Precision vs. recall for the six collections of interactions, as generated from predictions made using CLR instead of Inferelator.
34
Figure S4: Network inference accuracy is sensitive to RNA half-lives in a condition-specific manner. Related to Figures 1D and 2. Each panel shows 20 AUPR plots as as a function of pre-set RNA half-life, when the InfereCLaDR is trained on the the genes specific to the gene cluster (shown on the right) and on conditions specific to the condition cluster (shown on top). Each of the 20 lines corresponds to the 20 Gold Standard re-samples, where 50% of the GS was used for training, and the remaining 50% for calculating AUPR. ”translation clust.” refers to the whole cluster enriched in translation genes, and ”translation” refers only to known translation genes, shown in Figures 1D and 2.
35
Figure S5: Network inference accuracy is sensitive to RNA half-lives in a condition-specific manner, in terms of area under the ROC curve (AUROC). Related to Figure 1D. Each panel shows 20 AUROC plots as as a function of pre-set RNA half-life, when the InfereCLaDR is trained on the the genes specific to the gene cluster (shown on the right) and on conditions specific to the condition cluster (shown on top). Each of the 20 lines corresponds to the 20 Gold Standard re-samples, where 50% of the GS was used for training, and the remaining 50% for calculating AUROC.
36
Figure S6: Improvements due to RNA half-live variation and incorporation are independent of the choice of various parameters and evaluation metric. Related to Figures 1 and 3.
37
Figure S6: Improvements due to RNA half-live variation and incorporation are independent of the choice of various parameters and evaluation metric. Related to Figures 1 and 3. A) and B) The precisionrecall curves and the AUPRs as a function of RNA half-life, respectively, for the regime in which 34% of the GS was used for training the Inferelator, and the rest as a leave-out for validation. C) and D) The same metrics but for a regime in which 67% of the GS was used for training, and the rest for validation. E) and F) The true positives vs. false positives curves and the AUROCs as a function of RNA half-life, respectively, for the regime in which 34% of the GS was used for training, and the rest for validation. G) and H) The same metrics as E and F, but with 67% of GS as the training set. I-K: Interactions predicted by the InfereCLaDR vs the Inferelator, in terms of rank (lower rank is higher confidence). The horizontal and vertical blue lines show the precision=0.5 cutoff for Inferelator and InfereCLaDR, respectively. The red line maps the InfereCLaDR rank to the same rank in Inferelator. Hexagonal bins correspond to the number of counts of I) Gold Standard interactions, J) interactions not in the Gold Standard, but present in an orthogonal interactions database, and K) interactions in neither the Gold Standard nor any of the orthogonal interactions databases.
38
S3
Supplemental Experimental Procedures
Abbreviations: AUROC: Area Under the Receiver Operator Characteristic curve AUPR: Area Under the Precision and Recall curve FP: False Positives GO: Gene Ontology GS: Gold Standard of interactions NCSM: Nucleotide-Containing Small Molecule (metabolism genes) PR: Precision-Recall SGD: Saccharomyces Genomes Database TF: Transcription Factor TFA: Transcription Factor Activity TP: True Positives TRN: Transcription Regulatory Network
S3.1
Inferelator Worfklow
The Inferelator is a multi-step framework that aims to predict a TRN from expression data (Bonneau et al., 2006; Greenfield et al., 2013; Arrieta-Ortiz et al., 2015). As inputs, it takes RNA expression data as well as orthogonal sources of data. This orthogonal data can originate from Pol II binding, DNA accessibility, knock-out, overexpression assays, or motif data, and is normally compiled into a matrix of prior known interactions to be used for training and validation of the resulting network. The main idea behind the Inferelator is that the rate of change of RNA levels of a gene i can be expressed as a difference between its transcription and degradation of its transcript, where the transcription term is approximated as a linear combination of the activities of the gene’s potential TFs. This section describes additional details of how we implemented the original Inferelator and modified its workflow to develop InfereCLaDR. The general approach is described in Experimental Procedures 4.4. This document expands on this description. S3.1.1
Estimation of Transcription Factor Activities
We calculate TF activities (TFA) of a given TF from the expression levels of its prior known targets in the respective conditions, as described in (Liao et al., 2003; Arrieta-Ortiz et al., 2015). The primary motivation for calculating TFA, rather than using a TF’s RNA expression levels, is that the expression level of a TF is a poor proxy for its activity, and the activity is much better estimated by examining the expression level changes of a TF’s known targets, as demonstrated in Arrieta-Ortiz et al. (2015). As described in reference Arrieta-Ortiz et al. (2015), we calculate TFA by expressing the response variable, as defined in Equation 2 (Experimental Procedures 4.4), as a linear combination of TFA levels of each of its known regulators. Mathematically, this relation is expressed as such: Resp = P A,
(3)
where P is the connectivity matrix, A is the full, desired TFA matrix (with rows corresponding to TFs and columns to samples), and Resp is the response matrix with each row corresponding to the response variable as defined in Equation 2. The elements of the connectivity matrix P come from the GS or its subset, such that a value of 0 corresponds to no known interaction, 1 to a known positive interaction (activation), and -1 to a known negative interaction (repression). We calculate the activity matrix A by multiplying both sides by the pseudoinverse of P (i.e. the Moore–Penrose inverse, using the R function pseudoinverse from the corpcor package).
39
S3.1.2
Model Selection and Prediction Confidence Score
Selecting the set of TFs that regulate a specific gene involves several steps and is the essence of TRN inference, because this is the process in which we select the most likely regulatory network for every gene in the genome. The first step involves narrowing down the list of all genes known to be acting as TFs in the organism to a small set of p TFs that are specific to a given gene. This step is completed using timelagged Context Likelihood of Relatedness (tlCLR), a method for calculating context-dependent mutual information between gene i and its potential TF (Greenfield et al., 2010; Madar et al., 2010). Typically when using the Inferelator, this value is set to p = 10, which is what we also use here. In addition to these p regulators, TFs known to regulate gene i from prior known interactions are appended to its set of potential regulators. Denote this set as Pi . Selecting the most accurate model of regulation for a gene i is thereby equivalent to selecting the set of regulators Pi ⊂ Pi that optimizes an objective function. The choice of the objective function is a version of the Bayesian Information Criterion (BIC), modified with a Zellner’s g-prior, which incorporates prior known interactions into the model by reducing the sparsity penalty for prior known regulatory interactions. This approach, known as Bayesian Best Subset Regression (BBSR), is described in more detail in Greenfield et al. (2013). Note that for this step, we linearly scaled and shifted every gene in the response matrix and every TF in the design (TFA) matrix such that every row has mean 0 and variance 1 across samples, which was necessary to avoid having to estimate an extra y-intercept parameter corresponding to basal transcription rate. Once the model Pi that minimizes the BIC is selected, we employ a computational knock-out approach to calculate the confidence we have in each predicted regulatory interaction. The confidence score of the predicted interaction between gene i and TF j ∈ Pi is determined in the following manner. We first determine the parameters βi,k for all k ∈ Pi by performing linear regression on Equation 2, and use them to calculate the right-hand side of Equation 2. We denote this entity as the predicted profile. Accordingly, 2 be the variance of residuals the left-hand side of Equation 2 is denoted as the observed profile. Let σi,j between the predicted profile and the observed profile. Let Pi¬j be the same model of regulation of gene 2 be the variance of residuals between the observed profile and the i, but without TF j. Then let σi,¬j predicted profile as calculated using Pi¬j . Then the confidence score of the interaction between TF j and gene i is defined as confi,j = 1 −
2 σi,j 2 σi,¬j
(4)
To avoid overfitting and derive empirical confidence intervals for model parameters, tlCLR and BBSR are repeated on different but possibly overlapping subsets of the response matrix, denoted as bootstraps of the expression data. Every such bootstrap is a column-wise (i.e. sample-wise) subset of the response matrix. The columns are selected randomly using the R function sample with default settings. Using this command, we sample n columns with replacement, where n is the total number of columns in the full response matrix. For each bootstrap of the response matrix, the confidence score of every possible interaction is determined using Equation 4. The final combined confidence score for every interaction is determined by adding the ranks of the confidence scores of that interaction across bootstraps. Note that a high score corresponds to a low rank, i.e. a high-confidence prediction. We use 50 bootstraps for all analyses in this paper, except when specified otherwise. S3.1.3
Estimating Network Prediction Accuracy
We define validation set as the set of known true interactions that we validate our prediction against. Because the output of the Inferelator (and consequently, the InfereCLaDR) is a list of predicted interactions, ranked by their combined confidence scores from highest to lowest, we may define precision and recall as functions of i in the following way: precision is the fraction of the interactions in the top i lowest-ranked
40
(highest-scoring, best) predicted interactions that are also in the validation set, whereas recall is the number of interactions in the top i lowest-ranked predicted interactions divided by the total number of interactions in the validation set. We calculate precision and recall for every value of i in the full list of predicted interactions. Every PR curve is plotted by connecting the precision and recall values for consecutive values of i with straight lines, and the reported Area Under Precision-Recall curve (AUPR) is the area under this curve. True Positives (TPs) is a function of i that maps i to the number of interactions among the top i highest-ranked predicted interactions that are also in the validation set, whereas False Positives (FPs) is the number of interactions among the top i highest-ranked predicted interactions that are not in the validation set. We plot all TP-FP curves by connecting the TP and FP values for consecutive values of i with straight lines. We report Area Under the ROC curve (AUROC) by dividing the area under the TP-FP curve by the product of the total number of TPs and the total number of FPs. For calculating precision and recall, as well as TPs and FPs, on a leave-out set (i.e. when a certain leave-in set of interactions was used for training the network prediction algorithm, i.e. the TFA step and the BBSR step), we use the leave-out set as the validation set for calculations described above. The confidence scores of predicted interactions that appear in the leave-in set are set to 0, effectively moving them to the end of the ranked list of predicted interactions. This removes the effect of penalizing the leavein interactions that are correctly predicted by the algorithm when only using the leave-out for evaluation. The sign of the interactions is not taken into account in any of the AUPR and AUROC calculations. Predictions involving TFs and genes with no prior known interactions in the leave-out set are excluded from AUPR and AUROC calculations.
S3.2
Primary Data Processing
The meta data for every whole-genome expression sample (GSM) was downloaded using the R function getGEO from the GEOquery package (Edgar et al., 2002; Barrett et al., 2013). These GSMs were filtered such that the final list only contained samples measured in Saccharomyces cerevisiae using the Yeast Affymetrix 2.0 platform (GPL2529). For each expression series (GSE) that contained at least one of those GSMs, a TAR file was downloaded from the GEO website on March 23, 2015, using the R function download.file. The raw CEL files for every sample in that GSE (filtering for S. cerevisiae and GPL2529) were furthermore extracted from the associated TAR file. For processing and normalizing the raw CEL files, we used the R packages affy (Gautier et al., 2004) and gcrma (Wu et al., 2004), using the functions ReadAffy and gcrma. All samples were processed simultaneously as one batch. For time series meta data, all time differences were converted into minutes. To test for batch effects, we used ComBat (Johnson et al., 2007; Leek et al., 2012), treating either laboratory of origin or experiment series as batches, but did not observe an improvement in network prediction (data not shown). All computations were made in R programming language, version 3.1.2 (R Core Team, 2014).
S3.3
Condition Clustering
Principal Component Analysis (PCA) of the entire expression matrix was performed using the R command prcomp, treating each expression sample as a 5, 716-dimensional vector. K-means clustering of the dimensionally reduced (post-PCA) expression samples was performed using the R function kmeans, using the default parameters with nstart=25, and iter.max=1000, corresponding to 25 initial random configurations (among which the best one is automatically picked by the built-in algorithm in R) and a maximum of 1,000 iterations. PCA and k-means clustering was performed on scaled expression data (mean shifted to 0 and variance scaled to 1), and all subsequent analysis was done on the resulting clusters using the original (processed and normalized, but not scaled) expression data. We also performed some of the downstream analyses using clusters that were obtained the same way but without scaling, which did not lead to any significant differences in terms of inference performance (not shown).
41
The number of condition clusters was determined by rank-combining the confidence score outputs of InfereCLaDR (pre-half-life-fitting) across a pre-specified number of clusters, and maximizing the prediction accuracy of this procedure as a function of the number of condition clusters. Rank-combining across the predictions obtained from a set of disjoint expression clusters was done by taking the sum of the confidence score matrices (combinedconf*.RData files). We used the AUPR of the rank-combined prediction as a proxy for prediction accuracy. The pre-specified numbers of clusters that we tested were n = 2-10, 12, 14, 16, 18, 22, 26, 30, 40, 50, 60, 70, and 80. For a given n, the entire expression data set was first clustered into n clusters (as described in the paragraph above). For each n, we ran the InfereCLaDR separately for each of n expression clusters, using 50% of the GS as the leave-in set for TFA and model selection steps, and the rest as the leave-out set for calculating AUPR. We did not resample the GS for this analysis, we used 10 response matrix bootstraps, and set τ = 60 for this analysis. The same randomly sampled leave-in set was used across the InfereCLaDR runs for all clusters. Figure S1E shows AUPR as a function of the number of clusters n. Furthermore, it shows the AUPR of the worst-performing and the best-performing cluster among the n clusters. We notice that the AUPR is maximized around n = 4 and n = 7, where it is approximately equal to 0.35. We selected n = 4 as our optimal number of clusters because after n = 4, there is a sharp drop in the AUPR of the worst-performing cluster, rendering it uninformative. In addition, 4 clusters is optimal as it also minimizes overfitting in our modeling, leaving each cluster with a large number of diverse data points. Given that for every gene, we test approximately 500 potential TFs (jointly during the tlCLR step and the Bayesian regression step), we aim for at least 500 conditions in each cluster. However, clustering with n > 4 resulted in some clusters with much fewer than 500 conditions. This reasoning further solidifies our choice of the number of clusters. All downstream analyses were performed on these four clusters. To annotate the condition clusters, all text that was provided in every available sample of the meta data table was sorted into four bins according to the sample’s assigned cluster. We processed the list of words in every cluster as follows using the R package tm (Feinerer and Hornik, 2015; Feinerer et al., 2008). First, common English words, commas, and numbers were removed. All text was converted to lower case. Unnecessary white spaces were removed, and term frequency statistics were calculated using the R package SnowballC (Bouchet-Valat, 2014). Multiple hypothesis testing for enriched terms in each cluster was performed using the Bonferroni correction. To foster visualization and interpretability, we removed spurious terms, which we did not consider biologically relevant, from these lists. The final word clouds were made using the R function wordcloud from the package wordcloud (Fellows, 2012). To assign each cluster with a final label and to ensure that each topic is not enriched in its cluster as a result of lab bias in term usage, we compared the occurrence of enriched terms in each cluster and assessed whether they come only from one lab or from multiple labs (Supplemental File SuppData2.zip). See Supplemental Notes Section SN1 for a detailed discussion of primary and secondary enrichments in each condition cluster, and the justification for the final condition cluster naming scheme. Our condition cluster label assignments were also supported by GO enrichments in genes that had the highest relative variance compared to other genes in their cluster (not shown).
S3.4
Gene Clustering
Gene clustering was performed as described in Experimental Procedures 4.2. The reason for scaling expression data before clustering is that this approach results in more evenly-sized gene clusters. Since our method for determining gene- and condition-cluster specific RNA half-lives relies on the availability of prior known interactions with the target genes, we first performed gene clustering for the 997 genes that were present in our Gold Standard of interactions. We performed hierarchical clustering using the R function hclust with default arguments (e.g. euclidean distance metric). We determined the number of clusters by comparing several qualities of the resulting gene clusters as a function of the number of clusters. Given a pre-specified number of clusters k, we used the R function cutree to cut the similarity tree into k clusters. For each of the k resulting gene clusters, we performed Gene Ontology (GO) enrichment analysis (see Suppl. Experimental Procedures S3.5). After performing this analysis for k ranging from 2 to 8, we 42
determined that setting k > 5 does not result in any new clusters with meaningful new GO enrichments as compared to the enrichments obtained with k = 5 clusters. Additionally, having more than 5 clusters and small cluster sizes was unfavorable because the RNA half-life inference for a given gene cluster relies on AUPR maximization calculated on a leave-out list of prior known interactions involving the genes in that cluster. AUPR calculation gets increasingly noisy when the number of known true positives is small. For k = 5, the smallest number of prior known interactions with the genes in one of the five clusters was 101, and the largest was 592. This range was sufficient for producing relatively smooth half-life vs. AUPR curves (Figure S4). Picking a larger k resulted in the breaking down of the smallest cluster into smaller clusters. Therefore, we used 5 gene clusters for the analysis. Importantly, for the final yeast network, we included regulatory network predictions for all 5,716 genes in our expression data set. To this end, we assigned each gene with one of the clusters that was determined using the 997 genes from the GS (as described above). We denote the set of those 997 genes as gGS. For a given gene g, its cluster membership was assigned by finding the gene gprior ∈ gGS that minimizes the Euclidean distance between g and gprior , and assigning the cluster membership of gene gprior to gene g. This approach was superior to clustering all 5,716 genes, because the latter approach resulted in highly unevenly distributed cluster sizes within the 997 genes in gGS, and therefore very noisy AUPR calculations.
S3.5
Function Enrichment Analysis
For determining Gene Ontology (GO) enrichments, we employed the hypergeometric test using the hyperGTest function from the R Bioconductor package ‘GOstats’ and the R Bioconductor package ‘org.Sc.sgd.db’ (Carlson et al., 2014). We used the 997 genes that have at least one interaction in the GS as the background set. We corrected p-values for multiple hypothesis testing using the Bonferroni correction. For each gene cluster, we reported the top GO terms, all of which were statistically significant (p < 0.05). For the full list of gene ontology terms enriched in each cluster, see Supplemental File SuppData1.zip.
S3.6
Gold Standard Curation
The new, high-quality Gold Standard was created as described in Experimental Procedures 4.3 in the main text. Here we specify some further details. For example, in the YEASTRACT dataset, in case the sign of an interaction was specified, it appeared in the ”Association Type” column. When ”Association Type” had a label ”Negative”, it denoted a decrease in abundance after a TF knock-out (and a positive sign according to our convention), and ”Positive” denoted an increase (resulting in a negative sign according to our convention). To obtain interactions from Saccharomyces Genome Database, we downloaded 32,338 unique regulatory interactions using Yeastmine (Balakrishnan et al., 2012; Costanzo et al., 2014). We refer to these 32,338 interactions as the SGD standard (such as in Table S3). The SGD website provided signs for 10,022 of those interactions. We observed 363 agreements and 12 disagreements in signs of the overlapping signed interactions between the list of 1,155 high-confidence signed interactions obtained from YEASTRACT (as described in Experimental Procedures 4.3) and the 10,022 interactions from SGD. We used the signed portion of the SGD standard to expand our list of signed GS interactions in the following way. If the signed interaction from SGD was already in the list of 1,155 from YEASTRACT, we kept the sign that we obtained from YEASTRACT regardless of its sign in the SGD standard. If it was among the 2,577 high-confidence interactions obtained from YEASTRACT, but without a sign from YEASTRACT, we assigned it with the sign from SGD. This procedure provided an additional 117 signed interactions towards the final Gold Standard of interactions. Furthermore, we used the data from Kemmeren et al. (2014), which contained the results of a study that measured genome-wide expression level changes for 1,484 knock-outs, including most TFs in yeast. First, we created a collection of interactions by considering all knock-out-induced expression changes that were reported in the original publication with a p-value of p < 0.01 after the Bonferroni correction, yielding 43
> 193, 000 interactions (which we furthermore denote as the Kemmeren standard). Of these, 131 were present amongst the 2,577 high-confidence interactions described above. Adding these 131 interactions expanded our high-confidence set of interactions to 1,403. To identify the best set of possible prior known interactions in a consistent manner, we used a procedure estimating how effectively the Inferelator trained on one half of a collection of interactions can recover the other half of the same collection of interactions. Figure S3 compares the performance of the Inferelator on the 50% leave-out set of seven different collections of prior known interactions, when trained on the 50% leave-in set of the same seven collections (respectively). According to both the precision-recall curve and the Receiver Operating Characteristic curve, the final Gold Standard of 1,403 signed interactions is by far the most consistent, i.e. it has the highest area under the curve as compared to any of the sources of interactions shown in Table S1. The biggest improvement in AUPR is achieved by restricting the collection of interactions from YEASTRACT to the 2,577 unsigned interactions that have 1 piece of evidence in Yeastract-Direct and 2 pieces of evidence in Yeastract-Indirect (blue line, AUPR=0.20, AUROC=0.70). Reducing those interactions to only those for which we have evidence of the direction of the regulation in either of the three sources of interactions (YEASTRACT, SGD, and Kemmeren) further improves performance (Gold Standard, red line, AUPR=0.30, AUROC=0.73). Our Gold Standard also outperformed the MacIsaac regulatory interactions database (MacIsaac et al., 2006), which is commonly employed for evaluating network inference algorithms in yeast (Marbach et al., 2012; Siahpirani and Roy, 2016). We extended this comparison to other possible combinations of these collections of interactions (Figure S3E). We confirmed our observations using CLR (Context Likelihood of Relatedness, see below) instead of the Inferelator, confirming that combining one direct and two indirect sources of evidence results in the most self-consistent collection of interactions (Figure S3F). Our results led us to conclude that the 1,403 interactions of our Gold Standard constitute the most consistent and reliable collection of interactions in yeast known to date.
S3.7
Gene Cluster Half-Life Prediction and Validation
The gene cluster that was most prominently enriched in cytoplasmic translation was also the largest cluster (409 genes), and also contained many genes unrelated to translation. Therefore, for predicting RNA half-lives for cytoplasmic translation genes, we used only the genes that were annotated as ”cytoplasmic translation” genes in the Saccharomyces Genome Database (GO:0002181). This consisted of 171 unique gene names, 127 of which were ribosomal protein coding genes (either RPLs or RPSs). These genes were used to calculate the predicted translation gene RNA half-lives in Figure 2 (see Experimental Procedures 4.6). The predicted RNA half-life values for nucleobase containing small molecule (NCSM) metabolism genes were obtained using the genes in the ”nucleic acid metabolism” gene cluster in ”log-phase growth” and ”chemostat” condition clusters. We used the AmiGO 2 website (Carbon et al., 2009) to obtain 207 genes contained as members of the nucleobase containing small molecule metabolic process category (GO:0055086). These genes have experimentally measured RNA half-lives reported in Figure 2 (Neymotin et al., 2014). The list of ”cytoplasmic translation” and ”NCSM metabolism” genes here can be found in Supplemental File SuppData1.zip.
S3.8
Combining Condition-Specific Networks to Create the Final Yeast Network
The final network of regulatory interactions predicted by InfereCLaDR was created in several steps (Figure S2). First, bi-cluster specific RNA half-lives were determined by re-sampling the Gold standard into GStrain and GS-fit 20 times (i.e. employing the Split A approach). The predicted half-life for each bi-cluster was determined by taking the median across the half-lives that optimized AUPR in each of the 20 GS re-samples. Then the TFA and BBSR were trained on each bi-cluster using the full Gold Standard and the corresponding, predicted RNA half-life. The final combined confidence scores were obtained by adding the confidence scores across the condition clusters for every gene. To convert the final (combined) ranked 44
list of interactions into a set of predicted interactions, we set a rank cutoff based on the rank at which precision=0.5 for TFs and their targets in the GS. We applied this rank cutoff of 2,359 to all predictions (regardless of whether the TF or target gene had data in the GS), to obtain the final InfereCLaDR network (Figure 4). The procedure resulted in 1,462 new interactions that were not in the GS. These new interactions are listed in SuppNetwork1.tsv in Supplemental File SuppData1.zip. SuppNetwork2.tsv in Supplemental File SuppData1.zip contains all predictions and their precision values. We determined the rank cutoff for Genie3 network in the same way but using precision=0.15, as Genie3 predictions did not reach a precision over 0.20. We used same procedure to determine rank-based cutoffs for each condition cluster as we did for the full InfereCLaDR network, as well as for the final network predicted by the original Inferelator (yielding 1,922 interactions). To assign each interaction to a condition-cluster in the final InfereCLaDR network (Figure 4), we used the cluster with the lowest predicted rank determined. For each TF, we estimated the skewness of the distribution of the TF’s targets across the four condition clusters using the binomial test. Specifically, we compared the number of targets of each TF in a given cluster with the total number of its targets across all clusters, assuming using the ratio of all interactions in that cluster to all interactions across clusters as the null hypothesis. For each cluster, the five TFs with the highest skewness towards that cluster are shown in Figure 4 in large font. We also used the binomial test to calculate the enrichment of ”gained” interactions in each bi-cluster, treating the distribution of the sum of ”conserved” and ”removed” interactions as the null hypothesis (Figure 3F). The gene cluster of each interaction was assigned from the cluster of the target gene. The networks were visualized using the R packages igraph (Csardi and Nepusz, 2006) and network (Butts, 2008, 2015).
S3.9
Motif Enrichment Analysis
To validate the presence of the PACE-core motif 5’-(A/G)GTGGC-3’ (Shirozu et al., 2015) in CCT2, CCT3, CCT4, and CCT8 promoter regions, we used the online tool on the Yeastract website (http: //www.yeastract.com/searchbydnamotif.php) (Teixeira et al., 2006; Abdulrehman et al., 2011; Monteiro et al., 2008; Teixeira et al., 2014). We found that CCT2 and CCT4 each contain two exact matches for 5’-GGTGGC-3’ in the main and the complementary strand of their promoter regions, respectively. Furthermore, CCT8 and CCT3 each contained an exact match for 5’-AGTGGC-3’ in the main and the complementary strand of their promoter regions, respectively. Neither gene contained an exact match with the longer PACE sequence (Mannhaupt et al., 1999).
S3.10
Methods Comparisons
For implementing Genie3 (Huynh-Thu et al., 2010), we used the R code downloaded from the Genie3 website (http://www.montefiore.ulg.ac.be/∼huynh-thu/software.html) on September 22, 2016. We modified the original R code to allow for parallel processing, such that calculations on different genes could be performed on separate processors simultaneously. The implementation of the code was conducted as instructed in the README file downloaded from the same website. The R code for iRafNet (Petralia et al., 2015) was downloaded from the iRafNet website (http: //research.mssm.edu/tulab/software/irafnet.html) on January 27, 2016. We implemented it using the parameter specifications recommended in the original iRafNet paper (Petralia et al., 2015). In particular, we built the iRafNet weight matrix W using the leave-in portion of the Gold Standard by setting all values of 0 in our leave-in to 0.1 in the weight matrix, and all values of -1 and 1 to 1.1. We set the number of trees � ntree to 1,000, and number of potential regulators to be sampled from every node to mtry=round( ng − 1), where ng is the number of genes, and round is the rounding function. We implemented the Context Likelihood of Relatedness (CLR) method (Faith et al., 2007) by incorporating it into an Inferelator-like procedure analogous to Greenfield et al. (2010) and Madar et al. (2010). CLR scores for every potential interaction were calculated between the response of the target (left hand side of Equation 2 in Experimental Procedures 4.4) and the activity of the potential regulator (calculated as in Suppl. Experimental Procedures S3.1.1), using τi = 20. The BBSR and confidence score calculation steps 45
were entirely bypassed, and so the CLR score was used as the final confidence score for every interaction. Bootstrapping was performed as described in Suppl. Experimental Procedures S3.1.2. InfereCLaDR uses the original Inferelator framework, a new, high-quality GS, expression bi-clustering, and RNA half-life fitting (main text, Figure S2). Figure 3 (main text), Tables 1 (main text) and S2 show increases in performance based use of individual advances. For the ”Inferelator + GS” and ”Inferelator +MacIsaac” in Figure 3 (main text), Table 1 (main text) and Table S2, we set τi = 20 for all genes i and all conditions (see Equation 2 for a definition of τi ), which roughly corresponds to an assumed half-life of 14 minutes. This number was based on the median experimentally measured RNA half-life from various recent studies (Neymotin et al., 2014; Munchel et al., 2011; Miller et al., 2011). For the ”Inferelator + Clustering” (no half-life fitting) approach, we also used τi = 20 for all genes i and all four condition clusters. For the ”Inferelator + RNA half-life” (but no clustering) approach, we used the optimal value of RNA half-life as determined by maximizing the AUPR on the GS-fit collection of leave-out interactions from Split B approach (Figure S2), using the entire RNA expression data set as the input, rather than doing this analysis separately for each condition cluster. For each of these methods, training was done on the same set of 20 re-samples of the Gold Standard, as described in Experimental Procedures 4.5 and 4.6, using the Split B approach, and the AUPRs and AUROCs were calculated using the GS-validate set using the same 20 re-sampes (Figure S2). For each re-sample, individual comparisons in terms of AUPRs and AUROCs between different methods were made, and the number of times, out of 20, that the AUPR (or AUROC) of one method was higher than the other is shown in Table 1 (in terms of AUPR) and Table S2 (in terms of AUROC). Note that the calculation of RNA half-lives differs between the Split A and Split B approaches. The Split A approach reports a single value for each bi-cluster, which is obtained by taking the median of optimal half-lives across the 20 re-samples, where the optimal half-life is the one that maximizes the AUPR calculated on the GS-fit that corresponds to the given re-sample. Taking the median predicted half-life reduces half-life prediction error due to noise and re-sample specific effects, i.e. biases that arise from the particular interactions included in a given re-sampling of the Gold Standard. In the Split B approach, each RNA half-life prediction is derived from only one GS re-sample, because we prioritized guaranteeing the statistical significance of comparisons between methods in terms of network inference accuracy over RNA half-life prediction accuracy. That is, comparisons between methods were made for each GS re-sample separately, which resulted in 20 points of comparison for each pair of methods. Thus, Split B approach simultaneously inferred RNA half-life for each bi-cluster by maximizing AUPR over GS-fit produced from only one re-sample of the Gold Standard. Therefore, the magnitude of the increase in inference accuracy due to half-life fitting that we calculate using the Split B approach (Table 1) is an underestimate of the actual increase in accuracy of our final predicted network, which is produced using the Split A approach. We calculated the p-values in Figure 3F using the binomial test. The number of interactions predicted by the InfereCLaDR but not Inferelator (”gained”, top left quadrant in Figure 3C) in each bi-cluster (representing k, the ”number of successes”) was compared to the number of interactions falling into all three quadrants (”gained”, ”conserved”, and ”removed”) in the same bi-cluster (representing n, the ”number 1,076 of trials”). We assumed a null probability of 1,076+1,283+639 , or the total number of interactions in the ”gained” quadrant divided by the total number of interactions in all quadrants, independent of the bicluster. The sign by which the p-value was multiplied denoted whether the enrichment was on the upper tail (positive sign, i.e. higher than expected) or the lower tail (negative sign, lower than expected). All of the InfereCLaDR, Inferelator, iRafNet, Genie3, and CLR runs that involved the 20x re-sampling of the Gold Standard were performed on the NYU High Performance Cluster (https://wikis.nyu.edu/ display/NYUHPC), using the PBS job scheduling system (Henderson, 1995).
46
SN1
Supplemental Notes: Cluster Interpretation
This section discusses the clusters shown in Figure S1G which lists the most enriched terms of each of the condition clusters using the associated meta-data. Briefly, condition cluster 1 is from cells grown in chemostats, with experiments on nutrient limitation and long time series. Condition cluster 2 comprises experiments that examine oxidative stress or inhibit transcription, which usually relate to early measurements of RNA degradation rates (Shalem et al., 2008, 2011), with treatments such as methyl methane sulfonate (MMS) and phenanthroline. Cluster 3 is enriched in comparisons of mutant strains to wildtype, with cells grown in rich medium. Finally, condition cluster 4 includes fermentation involving industrial strains. The time series experiments in this cluster often last several days. The analysis of the enrichments shown in word clouds was only the first step in determining the most aptly fitting, biology-motivated name for each cluster. After the enrichment analysis, we examined each enriched term individually and counted its appearance in the cluster in which it was enriched compared to other clusters. We also counted from how many separate data series and laboratories the experiments whose meta data contained the term originated. The results of this analysis are summarized in the table in top-words_by-clust.pdf in Supplemental File SuppData2.zip. Specifically, while cluster 1 was more enriched in the terms “oscillation” and “cenpkd” (as shown in Figure S1G), the 195 samples that contained each of these terms came from only 6 and 11 series (from 3 and 6 labs), respectively. The next three most significant terms (aside from “period”, which is arguably redundant with “oscillation”) - “fermentor”, “continuous” and “chemostat” - were all related to cells growing in continuous culture, i.e. a chemostat or biofermentor. “Chemostat” is in the meta data of 187 samples coming from 14 series and 11 different labs. These numbers suggests that the enrichment in chemostat-grown cultures might be more biologically meaningful than the enrichment in the “oscillation” conditions, because it is supported by 3-4 times as many independent laboratories, and twice as many as for “cen.pk”. Given these considerations, the cluster was named “chemostat”. In cluster 2, “decay” was by far more enriched than any other term. The experiments that contained this term came from studies in which decay rates were measured by suppressing and inhibiting transcription, e.g. Shalem et al. (2008) (GSE12221) and Shalem et al. (2011) (GSE26829). Inhibiting transcription to measure RNA degradation is now recognized as a highly invasive method, delivering results very different from non-invasive, labeling-based methods that measure RNA degradation (Neymotin et al., 2014). To mark the fact that the cluster contained invasive methods for RNA degradation measurements, we name it the “transcription inhibition” cluster. The term “oxidative” was also enriched in this cluster, but to a much lower degree, and the cluster contained other smaller studies with unrelated stresses that nevertheless induced similar RNA expression levels. The most enriched terms in cluster 3 were “wild type” and “midlog”, referring to the growth of wild type yeast cells in mid-log phase growth conditions. Upon further inspection, we determined that around half of the experiments in this condition cluster involved cells grown in mid-log phase or in rich medium, YPD, or YEPD. Furthermore, approximately half of these cells were wild type, and subsequent inspection revealed that a lot of the studies compared mutants with wild-type cells in rich conditions. Because our naming convention is condition-focused rather than genotype-focused, we chose “log-phase growth” as the name of this cluster. Finally, for cluster 4, the top two enriched terms, “wort” and “fermentation”, are related to use of yeast for industrial-scale, fermentation-based brewing. “Day” was enriched only because a lot of these studies comprised of time-series with time differences between consecutive time points measured in days, as is typical for brewing. Fermentation/brewing related terms such as “lager”, “riesling”, “wine”, “malt”, and others, were also enriched, coming in total from 7 series and 5 different labs. Given the dominance of this theme in cluster 4, it was named the ”fermentation” cluster. Next, we clustered the data by gene expression patterns and tested for enriched gene functions (Experimental Procedures 4.2). The most significant enrichments in the five gene clusters included cytoplasmic translation, cell wall biogenesis, nucleic acid metabolism, and protein catabolism, respectively (p-value
47
< 0.05, Figure 1C). We used these categories to name the clusters. Large clusters contained additional enrichments. For example, in the ”translation” cluster, translation genes only occupy a fraction of the cluster, and the rest of the cluster is enriched in protein and ncRNA biosynthesis (p-value < 10−22 and p-value < 10−12 , respectively). See Supplemental File SuppData1.zip for the full list of tested GO terms and their corresponding enrichments for every gene cluster.
SN2
Supplemental Notes: Condition Cluster Half-Lives
Figures 2A and E shows that the optimal RNA half-life, as estimated by maximal AUPR, differed between condition clusters. Further, the shape of the distribution varies across condition clusters (Figure S4), suggesting different degrees of half-life dependence under different conditions. The unperturbed condition clusters have comparatively low RNA-half-lives of approximately 10 to 25 minutes. This prediction is consistent with experimental half-life estimates of 10 to 15 minutes for unperturbed conditions (Miller et al., 2011; Neymotin et al., 2014; Munchel et al., 2011), which we expect to be similar to ”chemostat” and ”log-phase growth” conditions. In comparison, the half-lives in the “transcription inhibition” cluster in Figure 2D cluster are longer than those under unperturbed conditions (Wilcoxon p < 10−10 ). The “transcription inhibition” cluster contains many studies that measured RNA decay rates by inhibiting transcription, e.g. Shalem et al. (2008). Indeed, it has recently been shown that these studies identify significantly longer RNA half-lives than those using less invasive metabolic labeling methods (Neymotin et al., 2014; Pelechano and P´erez-Ort´ın, 2008) (Wilcoxon p < 2 × 10−16 ).
SN3
Supplemental Notes: New Predictions
SN3.1
CCT Targets of Rpn4p
The four interactions (i.e. Rpn4p regulating CCT2, CCT3, CCT4, and CCT8 ) were absent from the Kemmeren database; only CCT4 was present in Yeastract-Direct database as directly bound by Rpn4p; only CCT3 and CCT8 were in the Yeastract-Indirect database; and only CCT4 was in the SGD database with computational support (see SuppNetwork2.tsv in Supplemental File SuppData1.zip). Several reasons explain why these interactions have so far gone undetected, but are strongly predicted by the InfereCLaDR. For example, the four mRNAs have optimal half-lives between 42 and 83 min, far longer than the median half-life of 10-20 minutes used in the original Inferelator. Further, these interactions are very condition-specific, that is, they are restricted to only one cluster, explaining why they are missed in global inference.
SN3.2
New Predictions With Indirect Validation
Another form of validation arises from gene knockout phenotypes. For example, the Tec1p transcription factor regulates filamentation genes, but also positively affects lifespan (M¨osch and Fink, 1997; Garay et al., 2014). Its predicted target TRX2 (predicted to be repressed) is a thioredoxin isoenzyme involved in the oxidative stress response (Garrido and Grant, 2002; Greetham et al., 2010). A screen by Postma et al. (2009) demonstrated increased lifespan in a TRX2 null mutant strain Postma et al. (2009), supporting indirectly our predicted repression of TRX2 by Tec1p. A second example relates to MRK1, which InfereCLaDR predicts as a new repression target of Yap1p, which is required for tolerance to oxidative stress and cadmium exposure (Kuge and Jones, 1994; Wemmie et al., 1994; Lee et al., 1999). Mrk1p itself activates stress response genes (Hardy et al., 1995; Hirata et al., 2003). Further, cellular cadmium levels increase in the MRK1 knockout (Yu et al., 2012), again supporting a repression link between MRK1 and Yap1p and response to cadmium.
48
SN3.3
Condition-Specific Predictions
Next, we explored interactions that were predicted due to InfereCLaDR’s condition specificity. Figure 4 shows the list of top five most condition-responsive TFs in each cluster. For many of these TFs, condition specificity is in line with their cellular function. For example, the ”chemostat” cluster is enriched in experiments related to glucose limitation (Figure S1G, Supplemental Data SupData2.zip). Both Cat8p and Adr1p are InfereCLaDR predicted regulators for which literature already describes a link to glucoselimiting conditions (Turcotte et al., 2010), but screens have missed them so far. The Cat8p TF is known to bind carbon source responsive elements in absence of glucose, and Adr1p is related to glucose-limiting conditions. Sok2p regulates growth under nitrogen limitation (Ward et al., 1995) and is a new, InfereCLaDR predicted regulator specific to the chemostat cluster (Figure 4). Indeed, the chemostat cluster contains all nitrogen-limitation experiments we used (Figure S1G, Supplemental Data SupData2.zip). Finally, Spt10p is an S-phase specific transcription factor (Xu et al., 2005), and we predict most of its targets within the chemostat cluster. Consistent with this prediction, the cluster contains 195 out of 210 samples related to cell cycle (Supplemental Data SupData2.zip). In sum, InfereCLaDR recognized the condition-specific activities of Cat8p, Adr1p, Sok2p, and Spt10p that previous approaches failed to predict. Another example arises from Mtf1p and Mhr1p, which were predicted as highly specific to the ”logphase growth” condition cluster. Mitochondrial Transcription Factor 1 (Mtf1p) facilitates high levels of transcription by bending and unbending their promoter and non-promoter regions, respectively (Lisowsky and Michaelis, 1989; Tang et al., 2011; Velazquez et al., 2012). While the mechanism of Mtf1p activity as a polymerase specificity factor is well-understood, its pattern of activity and targets are not. We predict that Mtf1p transcriptionally activates 24 mitochondrial genes (20 of which are ”gained” by InfereCLaDR). Of these interactions, 19 are among the most significant predictions in the ”log-phase growth” cluster, i.e. its interactions are highly condition-specific (Figure 4). The example of Mhr1p illustrates the opposite case, a regulator with many ”conserved” interactions for which InfereCLaDR predicts targets across diverse conditions. Note that similarly to Mtf1p, Mhr1p is predicted to regulate a large proportion of mitochondrial genes. Mhr1p is known for its role in different contexts, ranging from mitochondrial DNA replication and homologous recombination (Ling and Shibata, 2002; Hori et al., 2009) to nuclear transcriptional repression (Emili et al., 1998; Traven et al., 2002). Our results show that Mhr1p targets are distributed across different condition clusters (Figure 4), and that about half of them are also predicted by the original Inferelator, whereas almost none of Mtf1p’s targets were also predicted by the Inferelator. Taken together, InfereCLaDR predicts Mtf1p and Mhr1p to regulate mitochondrial genes in a dynamic, combinatorial way, with Mtf1p acting under few, specific and Mhr1p under multiple conditions.
Supplemental References Abdulrehman D., Monteiro P.T., Teixeira M.C., Mira N.P., Louren¸co A.B., dos Santos S.C., Cabrito T.R., Francisco A.P., Madeira S.C., Aires R.S., Oliveira A.L., S´a-Correia I., and Freitas A.T. (2011). YEASTRACT: providing a programmatic access to curated transcriptional regulatory associations in Saccharomyces cerevisiae through a web services interface. Nucleic acids research 39, D136–40. Arrieta-Ortiz M.L., Hafemeister C., Bate A.R., Chu T., Greenfield A., Shuster B., Barry S.N., Gallitto M., Liu B., Kacmarczyk T., Santoriello F., Chen J., Rodrigues C.D.A., Sato T., Rudner D.Z., Driks A., Bonneau R., and Eichenberger P. (2015). An experimentally supported model of the Bacillus subtilis global transcriptional regulatory network. Molecular systems biology 11, 839. Balakrishnan R., Park J., Karra K., Hitz B.C., Binkley G., Hong E.L., Sullivan J., Micklem G., and Cherry J.M. (2012). Yeastmine—an integrated data warehouse for saccharomyces cerevisiae data as a multipurpose tool-kit. Database 2012, bar062. 49
Barrett T., Wilhite S.E., Ledoux P., Evangelista C., Kim I.F., Tomashevsky M., Marshall K.A., Phillippy K.H., Sherman P.M., Holko M., Yefanov A., Lee H., Zhang N., Robertson C.L., Serova N., Davis S., and Soboleva A. (2013). NCBI GEO: archive for functional genomics data sets–update. Nucleic acids research 41, D991–5. Bonneau R., Reiss D.J., Shannon P., Facciotti M., Hood L., Baliga N.S., and Thorsson V. (2006). The Inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo. Genome biology 7, R36. Bouchet-Valat M. (2014). SnowballC: Snowball stemmers based on the C libstemmer UTF-8 library. R package version 0.5.1. Butts C.T. (2008). network: a package for managing relational data in r. Journal of Statistical Software 24. Butts C.T. (2015). network: Classes for Relational Data. The Statnet Project (http://statnet.org). R package version 1.13.0. Carbon S., Ireland A., Mungall C.J., Shu S., Marshall B., Lewis S., AmiGO Hub t.A., and Web Presence Working Group t.W.P.W. (2009). AmiGO: online access to ontology and annotation data. Bioinformatics (Oxford, England) 25, 288–9. Carlson M., Falcon S., Pages H., and Li N. (2014). Org. sc. sgd. db: Genome wide annotation for yeast. R package version 2. Costanzo M.C., Engel S.R., Wong E.D., Lloyd P., Karra K., Chan E.T., Weng S., Paskov K.M., Roe G.R., Binkley G., Hitz B.C., and Cherry J.M. (2014). Saccharomyces genome database provides new regulation data. Nucleic acids research 42, D717–25. Csardi G., and Nepusz T. (2006). The igraph software package for complex network research. InterJournal Complex Systems, 1695. Edgar R., Domrachev M., and Lash A.E. (2002). Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research 30, 207–210. Emili A., Kobayashi R., and Ingles C.J. (1998). A novel yeast protein influencing the response of RNA polymerase II to transcriptional activators. Proceedings of the National Academy of Sciences of the United States of America 95, 11122–7. Faith J.J., Hayete B., Thaden J.T., Mogno I., Wierzbowski J., Cottarel G., Kasif S., Collins J.J., and Gardner T.S. (2007). Large-Scale Mapping and Validation of Escherichia coli Transcriptional Regulation from a Compendium of Expression Profiles. PLoS Biology 5, e8. Feinerer I., and Hornik K. (2015). tm: Text Mining Package. R package version 0.6-2. Feinerer I., Hornik K., and Meyer D. (2008). Text mining infrastructure in r. Journal of Statistical Software 25, 1–54. Fellows I. (2012). wordcloud: Word clouds. R package version 2, 109. Garay E., Campos S.E., Gonz´alez de la Cruz J., Gaspar A.P., Jinich A., and Deluna A. (2014). Highresolution profiling of stationary-phase survival reveals yeast longevity factors and their genetic interactions. PLoS genetics 10, e1004168. Garrido E.O., and Grant C.M. (2002). Role of thioredoxins in the response of Saccharomyces cerevisiae to oxidative stress induced by hydroperoxides. Molecular Microbiology 43, 993–1003. 50
Gautier L., Cope L., Bolstad B.M., and Irizarry R.A. (2004). affy–analysis of Affymetrix GeneChip data at the probe level. Bioinformatics (Oxford, England) 20, 307–15. Greenfield A., Hafemeister C., and Bonneau R. (2013). Robust data-driven incorporation of prior knowledge into the inference of dynamic regulatory networks. Bioinformatics (Oxford, England) 29, 1060–7. Greenfield A., Madar A., Ostrer H., and Bonneau R. (2010). DREAM4: Combining genetic and dynamic information to identify biological networks and dynamical models. PloS one 5, e13397. Greetham D., Vickerstaff J., Shenton D., Perrone G.G., Dawes I.W., and Grant C.M. (2010). Thioredoxins function as deglutathionylase enzymes in the yeast Saccharomyces cerevisiae. BMC Biochemistry 11, 3. Hardy T., Wu D., and Roach P. (1995). Novel Saccharomyces cerevisiae Gene, MRK1, Encoding a Putative Protein Kinase with Similarity to Mammalian Glycogen Synthase Kinase-3 and Drosophila ZesteWhite3/Shaggy. Biochemical and Biophysical Research Communications 208, 728–734. Henderson R.L. (1995). Job scheduling under the portable batch system. In Workshop on Job Scheduling Strategies for Parallel Processing (Springer), pp. 279–294. Hirata Y., Andoh T., Asahara T., and Kikuchi A. (2003). Yeast glycogen synthase kinase-3 activates Msn2p-dependent transcription of stress responsive genes. Molecular biology of the cell 14, 302–12. Hori A., Yoshida M., Shibata T., and Ling F. (2009). Reactive oxygen species regulate DNA copy number in isolated yeast mitochondria by triggering recombination-mediated replication. Nucleic acids research 37, 749–61. Huynh-Thu V.A., Irrthum A., Wehenkel L., and Geurts P. (2010). Inferring Regulatory Networks from Expression Data Using Tree-Based Methods. PLoS ONE 5, e12776. Johnson W.E., Li C., and Rabinovic A. (2007). Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics 8, 118–127. Kemmeren P., Sameith K., van de Pasch L.A.L., Benschop J.J., Lenstra T.L., Margaritis T., O’Duibhir E., Apweiler E., van Wageningen S., Ko C.W., van Heesch S., Kashani M.M., Ampatziadis-Michailidis G., Brok M.O., Brabers N.A.C.H., Miles A.J., Bouwmeester D., van Hooff S.R., van Bakel H., Sluiters E., Bakker L.V., Snel B., Lijnzaad P., van Leenen D., Groot Koerkamp M.J.A., and Holstege F.C.P. (2014). Large-scale genetic perturbations reveal regulatory networks and an abundance of gene-specific repressors. Cell 157, 740–52. Kuge S., and Jones N. (1994). YAP1 dependent activation of TRX2 is essential for the response of Saccharomyces cerevisiae to oxidative stress by hydroperoxides. The EMBO journal 13, 655–64. Lee J., Godon C., Lagniel G., Spector D., Garin J., Labarre J., and Toledano M.B. (1999). Yap1 and Skn7 control two specialized oxidative stress response regulons in yeast. The Journal of biological chemistry 274, 16040–6. Leek J.T., Johnson W.E., Parker H.S., Jaffe A.E., and Storey J.D. (2012). The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28, 882–883. Liao J.C., Boscolo R., Yang Y.L., Tran L.M., Sabatti C., and Roychowdhury V.P. (2003). Network component analysis: reconstruction of regulatory signals in biological systems. Proceedings of the National Academy of Sciences of the United States of America 100, 15522–7. Ling F., and Shibata T. (2002). Recombination-dependent mtDNA partitioning: in vivo role of Mhr1p to promote pairing of homologous DNA. The EMBO journal 21, 4730–40.
51
Lisowsky T., and Michaelis G. (1989). Mutations in the genes for mitochondrial RNA polymerase and a second mitochondrial transcription factor of Saccharomyces cerevisiae. Molecular & general genetics : MGG 219, 125–8. MacIsaac K.D., Wang T., Gordon D.B., Gifford D.K., Stormo G.D., and Fraenkel E. (2006). An improved map of conserved regulatory sites for Saccharomyces cerevisiae. BMC bioinformatics 7, 113. Madar A., Greenfield A., Vanden-Eijnden E., and Bonneau R. (2010). DREAM3: Network inference using dynamic context likelihood of relatedness and the inferelator. PLoS ONE 5. Mannhaupt G., Schnall R., Karpov V., Vetter I., and Feldmann H. (1999). Rpn4p acts as a transcription factor by binding to PACE, a nonamer box found upstream of 26S proteasomal and other genes in yeast. FEBS Letters 450, 27–34. Marbach D., Costello J.C., K¨ uffner R., Vega N.M., Prill R.J., Camacho D.M., Allison K.R., Kellis M., Collins J.J., and Stolovitzky G. (2012). Wisdom of crowds for robust gene network inference. Nature methods 9, 796–804. Miller C., Schwalb B., Maier K., Schulz D., D¨ umcke S., Zacher B., Mayer A., Sydow J., Marcinowski L., D¨olken L., Martin D.E., Tresch A., and Cramer P. (2011). Dynamic transcriptome analysis measures rates of mRNA synthesis and decay in yeast. Molecular systems biology 7, 458. Monteiro P.T., Mendes N.D., Teixeira M.C., D’Orey S., Tenreiro S., Mira N.P., Pais H., Francisco A.P., Carvalho A.M., Louren¸co A.B., S´a-Correia I., Oliveira A.L., and Freitas A.T. (2008). YEASTRACTDISCOVERER: new tools to improve the analysis of transcriptional regulatory associations in Saccharomyces cerevisiae. Nucleic acids research 36, D132–6. M¨osch H.U., and Fink G.R. (1997). Dissection of filamentous growth by transposon mutagenesis in Saccharomyces cerevisiae. Genetics 145, 671–84. Munchel S.E., Shultzaberger R.K., Takizawa N., and Weis K. (2011). Dynamic profiling of mRNA turnover reveals gene-specific and system-wide regulation of mRNA decay. Molecular biology of the cell 22, 2787– 95. Neymotin B., Athanasiadou R., and Gresham D. (2014). Determination of in vivo RNA kinetics using RATE-seq. RNA (New York, N.Y.) 20, 1645–52. Pelechano V., and P´erez-Ort´ın J.E. (2008). The transcriptional inhibitor thiolutin blocks mRNA degradation in yeast. Yeast 25, 85–92. Petralia F., Wang P., Yang J., and Tu Z. (2015). Integrative random forest for gene regulatory network inference. Bioinformatics (Oxford, England) 31, i197–205. Postma L., Lehrach H., and Ralser M. (2009). Surviving in the cold: yeast mutants with extended hibernating lifespan are oxidant sensitive. Aging 1, 957–960. R Core Team (2014). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Shalem O., Dahan O., Levo M., Martinez M.R., Furman I., Segal E., and Pilpel Y. (2008). Transient transcriptional responses to stress are generated by opposing effects of mRNA production and degradation. Molecular systems biology 4, 223. Shalem O., Groisman B., Choder M., Dahan O., and Pilpel Y. (2011). Transcriptome kinetics is governed by a genome-wide coupling of mRNA production and degradation: a role for RNA Pol II. PLoS genetics 7, e1002273. 52
Shirozu R., Yashiroda H., and Murata S. (2015). Identification of minimum Rpn4-responsive elements in genes related to proteasome functions. FEBS Letters 589, 933–940. Siahpirani A.F., and Roy S. (2016). A prior-based integrative framework for functional transcriptional regulatory network inference. Nucleic acids research pp. e21–e21. Tang G.Q., Deshpande A.P., and Patel S.S. (2011). Transcription factor-dependent DNA bending governs promoter recognition by the mitochondrial RNA polymerase. The Journal of biological chemistry 286, 38805–13. Teixeira M.C., Monteiro P., Jain P., Tenreiro S., Fernandes A.R., Mira N.P., Alenquer M., Freitas A.T., Oliveira A.L., and S´a-Correia I. (2006). The YEASTRACT database: a tool for the analysis of transcription regulatory associations in Saccharomyces cerevisiae. Nucleic acids research 34, D446–D451. Teixeira M.C., Monteiro P.T., Guerreiro J.F., Gon¸calves J.P., Mira N.P., Dos Santos S.C., Cabrito T.R., Palma M., Costa C., Francisco A.P., Madeira S.C., Oliveira A.L., Freitas A.T., and S´a-Correia I. (2014). The YEASTRACT database: An upgraded information system for the analysis of gene and genomic transcription regulation in Saccharomyces cerevisiae. Nucleic Acids Research 42, D161–D166. Traven A., Staresinci´c L., Arneri´c M., and Sopta M. (2002). The yeast protein Xtc1 functions as a direct transcriptional repressor. Nucleic acids research 30, 2358–64. Turcotte B., Liang X.B., Robert F., and Soontorngun N. (2010). Transcriptional regulation of nonfermentable carbon utilization in budding yeast. FEMS yeast research 10, 2–13. Velazquez G., Guo Q., Wang L., Brieba L.G., and Sousa R. (2012). Conservation of promoter melting mechanisms in divergent regions of the single-subunit RNA polymerases. Biochemistry 51, 3901–10. Ward M.P., Gimeno C.J., Fink G.R., and Garrett S. (1995). SOK2 may regulate cyclic AMP-dependent protein kinase-stimulated growth and pseudohyphal development by repressing transcription. Molecular and cellular biology 15, 6854–63. Wemmie J.A., Wu A.L., Harshman K.D., Parker C.S., and Moye-Rowley W.S. (1994). Transcriptional activation mediated by the yeast AP-1 protein is required for normal cadmium tolerance. The Journal of biological chemistry 269, 14690–7. Wu Z., Irizarry R.A., Gentleman R., Martinez-Murillo F., and Spencer F. (2004). A Model-Based Background Adjustment for Oligonucleotide Expression Arrays. Journal of the American Statistical Association 99, 909–917. Xu F., Zhang K., and Grunstein M. (2005). Acetylation in Histone H3 Globular Domain Regulates Gene Expression in Yeast. Cell 121, 375–385. Yu D., Danku J.M.C., Baxter I., Kim S., Vatamaniuk O.K., Vitek O., Ouzzani M., and Salt D.E. (2012). High-resolution genome-wide scan of genes, gene-networks and cellular systems impacting the yeast ionome. BMC genomics 13, 623.
53