Universal Skewed Distribution Associated with Stochastic Processes of Gene Expression in a Eukaryotic Cell Vladimir A. Kuznetsov Laboratory of Integrative and Medical Biophysics, NICHD, NIH, Bethesda, MD, 20892, USA;
[email protected]
toplasm where it is translated into proteins. The expression level of any protein-coding gene is generally measured by the number of associated mRNA transcripts (messenger RNA abundance) present in a sample from many thousands of cells. While the mRNA abundance in a cell at a given moment does not guarantee the precise prediction of amounts of subsequently produced protein, mRNAs, sampled from same-type cell population, nevertheless serve as important indicators that a certain proteins are being produced.
ABSTRACT Diverse large-scale gene-expression data sets have been analyzed in order to identify the frequency distribution of gene expression levels (or transcript copy numbers) in eukaryotic cells. Determining such function(s) may provide a theoretical basis for accurately counting all expressed genes in a given cell and for understanding gene expression control. We found that observed gene expression levels data in yeast, mouse and human appears to follow a Pareto-like discrete skewed frequency distribution, observed also for many other complex systems. We also described an empirical frequency distribution of 3'end NlaIII sites in mRNA sequences of yeast genome, which allowed us to develop a novel method for estimating and removing major experimental errors and redundancies from SAGE data sets. We produced a stochastic model which predicts the skewed probability function that accounts for many rarely transcribed genes in a single cell. A sporadic and random transcription mechanism for all protein-coding genes in every cell is predicted. This fundamental mechanism might provide a basis level of phenotypic diversity and adaptability in eukaryotic cell populations.
The complete gene expression pro¯le for a given cell is the list of all expressed genes, together with each gene's expression level de¯ned as the number of cytoplasmic mRNA transcripts in the cell (Bishop et al, 1974; Strausberg et al, 2000; Emmert-Buck et al, 2000). However, gene-expression pro¯ling technologies (e.g., serial analysis of gene expression (SAGE) (Velculescu et al, 1995; Velculescu et al, 1997; Zhang et al, 1997), GeneChips (Lockhart et al, 1996; Holstege et al:, 1998; Jelinsky et al, 1999; Wodicka et al, 1997) currently can only measure the gene expression levels for a fraction of all expressed genes based on sampling transcripts found in millions of cells (i.e., not a single cell). These methodologies to determine certain short tags on a transcript, one can then count the numbers of transcripts carrying the same tag. Many genes, in particular those expressed at low levels or those encoded small mRNAs, cannot be unambiguously detected due to the limited sampling of transcripts and experimental errors. However, many of these transcripts may be essential for gene expression control, and determining normal and pathological cell phenotypes.
Keywords and phrases: gene expression levels, stochastic processes, experimental errors, binomial differential distribution, a single cell
1
Introduction
Cells must adjust genome expression to accommodate changes in their environment and in outside signals. Gene expression within a cell is a complex process involving chromatin remodeling, selective transcription of DNA into mRNA, mRNA export from the nucleus to cy-
Statistically gene expression behavior can be characterized by the gene expression level probability func1
similar or overlapping s500 nucleotide cDNA sequences, called UniGene clusters (Strausberg et al, 2000; Emmert-Buck et al., 2000), are used to group observed sequences into clusters representing presumed genes or ESTs on sequence homology. The UniGene clusters can be used to \tag" genes expressed in speci¯c cell types. The occurrence frequencies of each UniGene in cDNA library might serve as estimators of the gene expression levels in the cell population from which the cDNA library was constructed.
tion (GELPF). The GELPF is a function that for each possible gene expression level value takes on the probability of that value occurring for a given gene. For a cell or cell population, this function speci¯es the proportions of expressed genes which have 1, 2, etc. transcripts present. Given histograms of gene expression level values, we can model the underlining \population" probability functions. Such statistical knowledge bears upon the fundamental biological problems of cell regulation, adaptation and development. General features of gene expression patterns were elucidated more than 25 years ago through RNA-DNA hybridization measurements (Bishop et al, 1974). The gene expression data has skewed long tail frequency distributions (Bishop et al., 1974; Kuznetsov, 2001), which are also often observed in physiological processes (Ramsdem and Vohradsky, 1998) and in DNA-related phenomena (Borodovsky and Gusein-Zade, 1989; Mantegna et al, 1994; Li et al, 1999), as well as in many selforganizing systems with strong stochastic components (Adam, 1998; Stanley et al, 1999). However, mathematical models of the underlying distribution of gene expression levels in eukaryotic cells have not been previously identi¯ed due to undersampling and non-reliable detection of many low abundance genes, as well as sequencing errors and complications of tag-gene matching. The goal of this study is to develop and apply such models for characterization of the global gene expression pro¯les in a single eukaryotic cell and in a population of the cells.
2 2.1
Data Bases, Software
Methods,
The SAGE methodology is based on isolating distinct 10-nucleotide DNA sequences called SAGE tags from 3' end regions of individual transcripts and concatenating the tags serially into long DNA molecules (Velculescu et al, 1977; Zhang et al, 1997). Cloning and sequencing of such molecules allows the identi¯cation and enumeration of cellular mRNA transcripts. Since the genome organization in yeast (Saccharomyces cerevisiae) is relatively simple, and since almost all yeast genes are known and are mapped on chromosomes, we have analyzed with a large yeast SAGE database (www.sagenet.org, http:genome-www.stanford.edu/ Saccharomyces) (Velculescu et al, 1997). We analyzed three SAGE libraries for yeast cells in log phase, S-phase-arrested, and G2/M phase-arrested states separately and pooled. SAGE and cDNA libraries for various human cell lines and cell tissues were downloaded from UniGene and CGAP (www.ncbi.nlm.nih.gov/ CGAP/ncicgap; www.ncbi.nlm.gov/SAGE) databases. Some of these libraries are characterized in Table 1. DNA chip technology can measure the expression of thousands of genes simultaneously (Lockhart et al, 1996; Wodicka et al, 1997; Holstege et al, 1998) onto ~1 cm2 square onto which a cDNA mixture derived from a cell population can hybridized to form labeled complimentary spots. For example, in A®ymetrix GeneChips, each gene is represented on the high density oligonucleotide arrays by s20 unique 25mer oligonucleotide probes, that match the sequence of the gene (perfect match oligo's) and 20 oligonucleotide probes that are identical but differ by one base (mismatch oligo's). The gene expression levels in a cell sample estimated by the mean of the di®erences in the hybridization signals of matched and
and
Data Bases
There are several useful methodologies that allow global quantitative measure of RNAs captured from cells of interest. All these techniques make and then use DNA sequences complimentary to less stable mRNA molecules. cDNA library method counts the number of sequences having the matching or overlapping sequences. Such cDNA expression sequence tags (ESTs) consisting of 2
mismatched probes labeled message hybridized with the probes. The mean di®erence value over 20 paired signals is a measure of the expression level of that gene. Based on this estimate, the computed score of the gene expression levels can sometimes be negative; therefore, database could be additionally scaled for positive values. In this report, data sets of oligonucleotide arrays containing probes for s6,200 yeast open reading R frames (ORFs) and genes (GeneChip° Ye6100 arrays, A®ymetrix, Santa Clara, CA) (Lockhart et al, 1996; Wodicka et al, 1997) have been scaled and analyzed.
2.2
Random Sampling Sub-Libraries
ª-criterion ranges between excellent (10-8], very good (8-6], and satisfactory (6-4). Parameters in both di®erential and algebraic models were estimated using the Marquardt-Levenberg iterative curve ¯tting algorithm in MLAB mathematical modeling software (Civilized Software, Inc., www.civilized.com) coupled with Monte-Carlo re¯nements of random sampling °uctuations. We also used an additional standard goodness of ¯t MLAB criteria (sum of squares for deviations, a residual analysis, the Wilcoxon 2-sample rank-order test, etc.). Symbolic differentiation was performed using MLAB. Monte-Carlo experiments and numerical analysis were also performed in MS Digital Visual Fortran. Data-mining tools of the Cancer Research Anatomy Project including Xpro¯ling, SAGE/map (Lal, et al. 1999; Lash et al, 2000, www.ncbi.nlm.nih.gov/SAGE) have been also used.
to Construct
Assuming each identi¯ed transcript is selected at random, we used Monte-Carlo sub-sampling of transcripts in a library without replacement to generate sublibraries. Sub-libraries are needed in order to generate same-size libraries for comparison and to construct the growth curve for distinct true tags or genes of a given library.
2.3
3
Goodness of Fit Analysis Method and Software
3.1
Let the data points (m; g(m)) for values m = 1; :::J form an empirical relative frequency distribution g: Note PJ m=1 g(m) = 1: We adjusted the vector of parameters a ¹ (a1; a2;::: ; av ) in the model probability function f (m; a ¹) to ¯t the histogram points (m; g(m)) by maximizing the similarity between f using the modi¯ed Akaike's Information Criteria (Lujng, 1998) which we will call the Model Selection Criteria (MSC):
Skewed Distributions of Gene Expression Levels in Eukaryotic Cells Empirical Histograms and Paretolike statistics
Lets de¯ne an expression library as a set of cDNA's sequenced tags that match genes. These tags derived from a sample of mRNAs isolated from a population of cells. The size of a library, M, is the total number of tags observed in the library. Let n(m; M) denote the number of distinct tags, which have expression level m (tags) in the library of size M: Let J denote the maximum observed P expression level of tags in the library. Let N = Jm=1 n(m; M ); N is the number of distinct tags in the library. The points (m; n(m; M )=N ) for m = 1; :::J form the histogram corresponding to the empirical relative frePJ quency distribution g(m): Note m=1 n(m; M )=N = 1:
PJ
(g(m) ¡ E(g))2 ª = log( PJ m=1 ) ¡ 2º=J; ¹))2 m=1 (g(m) ¡ f (m; a
Note that due to experimental errors, the observed values of m and n might only approximately re°ect the transcripts numbers (or gene expression level) for a given gene and the number of genes represented by m transcripts, respectively. The observed values M and N also
where J is the maximum observed value of m, º is the number of unknown parameters of the model f , and E(:) is the mean value of the observed data. The most appropriate model will be that with largest ª. The ª is independent of the scaling of the data points. The 3
only approximately re°ect the total number of mRNA transcripts and the number of di®erent transcripts in a library (see Section 4).
tity J = J(M ): We call Eq.(1) the Generalized Discrete Pareto (GDP) model. The parameter k characterizes the skewness of the probability function; the parameter b characterizes the deviation of the GDP distribution from a simple power law.
The histogram of the proportions of distinct tags (the 10 bp tags of SAGE libraries or the expression sequence tags (ESTs) of cDNA libraries) represented by one, two, etc. tags is the empirical relative frequency distribution of tags which re°ects the gene expression levels in a given cell sample. This is a size-frequency form of the probability distribution which represents an estimate of the GELPF for the corresponding cell sample (Figure 1). We found that such histograms, constructed for all yeast, mouse and human gene-expression libraries, exhibited remarkably similar, monotonicallyskewed shapes (Figure 1, Figure 2) with a greater abundance of rarer transcripts and more gaps among the higher-occurrence level values. Analysis over 30 di®erent SAGE and cDNA libraries (Lal, et al. 1999; Zhang et al, 1997; Velculescu et al, 1999; Emmert-Buck, et al, 2000; Lash et al, 2000), made from various speci¯c human bulk tissues, from histologically homogeneous cells, and from cell lines con¯rmed our observations and goodness of ¯t results. There are partially presented in Table 1.
The GDP model with b 6= 0 provides the best ¯t to almost all empirical histograms we studied. In the log-log plot forms (Zipf's plot), the empirical distributions for larger human SAGE and all cDNA libraries show systematic deviations from a straight line (see, for example, Fig. 2A). In SAGE libraries with a library size less than s 40,000 tags, the GDP-model at b = 0 ¯ts well (see Figure 1A and Figure 2A). By our criteria, the GDP model has priority in comparison to more complex models, for example, a mixed logarithmic series-exponential distribution. For example, for library sizes greater than 40,000, the values of the MSC-criterion for the latter model were regularly s20-40% less than for the GDP model (data not presented). Note that, given the number of distinct tags, N, and the best-¯t parameters k and b, we can sample values of X; N times at random from the probability function Eq.(1) and then generate a histogram which will simulate an empirical frequency distribution for the given value N with a random number of tags, M = M (N ; k; b), in the sample. This Monte-Carlo procedure was used in order to estimate the variability of any expression levels associated with N distinct tags in a given library and to estimate the maximum gene expression level. Using this procedure many times with the GDP model, we computed the largest expression level J and the scale factor s such that s = J=M for each Monte-Carlo experiment. We then averaged these scale-factors to obtain their mean s^: We did that for our SAGE libraries and found that values of s^ ranged in [0:012 ¡ 0:045]. Similar ranges were observed in empirical ratios J=M (Table 1).
Several classes of skewed probability functions (Poisson, exponential, logarithmic series, power law Pareto-like (Johnson et al, 1992) were ¯t to empirical gene expression level histograms for various libraries. The best ¯t (by our criterions) was obtained using the discrete Pareto-like probability function: f (m) := Pr(X = m) =
1 1 ; z (m + b)k+1
(1)
where the random variable X is the expression level for a randomly chosen distinct tag (representing a gene). The function value f (m) is the probability that a randomly chosen distinct gene is represented by m tags ( representing an expression level). The argument m denotes a possible value of X. The function f involves two unknown parameters, k, and b, where k > 0, and b > ¡1; z is the generalized Riemann Zeta-function P 1 value: z = Jj=1 (j+b) k+1 :
3.2
E®ect of library size on empirical gene-expression level distributions
Similarly-sized libraries derived from various human tissues or from cancer cell lines have many similar numbers of expressed genes (see ¯gure legend, Figure 2A, Table
Note our model involves the sample-dependent quan-
4
transcription events of some genes may in fact be correlated in a given cell, most transcription events in a cell population seem to be random, independent events. This is consistent with observations in (Chelly et al., 1989; Ko, 1992; Ross et al, 1994; Walters et al, 1995; Newlands et al, 1998; Hume, 2000 ). We further assume that tags in a SAGE library or a cDNA library are chosen at random. Our assumptions are consistent with constructing such libraries by sampling from a hypergeometric distribution (Johnson et al, 1992). We also take into account that a typical SAGE (and cDNA) library size (s 103 ¡ 105 tags) is much smaller than the number of transcripts in a typical cell sample (> 1011 transcripts in >106 cells). That allows us to use the multinomial approximation (Johnson et al, 1992) of the hypergeometric distribution and leads to the skewed distribution model. This new model also explains the GELPF invariance for many cell types (see Discussion and Appendix).
1). They also are characterized by similar empirical relative frequency distributions of gene expression levels with nearly equivalent parameters in their best-¯t probability function models (see, for example, the prostate cancer cell line (library 2892a) and sub-sampled normal brain cell tissue library 154 in ¯gures 1B). Although the yeast genome is less complex, yeast libraries show similar relationships (see Table 1). However, as library size increases, the fraction of low abundance distinct tags becomes smaller (see parameter p1 (the fraction of distinct tags represented by one copy, see Table 1), and the shape of the probability distribution function changes systematically (b becomes bigger, see Table 1; Figure 1C). We also found that the value of the maximum observed gene expression level in the sample, J, was strongly linearly correlated with the library size M (Table 1). Thus, we might assume that all human and mouse cell types and yeast cells have a common GELPF. However, a single ¯xed GDP model (where parameters are constants) cannot describe all empirical frequency distributions independent of library size, since the probability function changes as the number of transcripts in a library becomes larger.
4
Analysis of Experimental Errors of SAGE Method
For all methods of gene-expression pro¯ling use speci¯c sequence "tags" for gene identi¯cation. However, the tag sequences match a short sequence of a gene loci. The 10 bp segment immediately downstream of the 3' NlaIII site, CATG, (on coding strand) corresponds to expressed SAGE tag. The SAGE tags are signi¯cantly shorter than EST of cDNA libraries, and there are signi¯cantly more SAGE tags available for analysis. This suggests a greater sensitivity for detecting low-abundance transcripts.
Interestingly, in self-similar (fractal) systems, described by a power law or Pareto-like distributions, the parameter(s) are independent of the size of the system (Stanley et al, 1999), but not in our case. Moreover, such models, including the GDP model, predict an unlimited increase in the number of species as the sample size approaches in¯nity, whereas the number of expressed genes is a ¯nite number. The problems of library size dependence of the GDP model parameters and the incorrect in¯nite limit for the number of genes as M! 1 are both solved by introducing a new statistical distribution model, which we will call the Binomial Di®erential distribution (see Appendix). By this model we assumed that (1) the number of expressed genes in a cell or in cell population is a ¯nite number, Nt ; (2) each gene in a given cell population is expressed with a certain probability, and (3) a transcription event for a given gene is statistically independent of such events for other genes. Although
To identify the correct distribution of gene expression levels in cells by ¯tting the empirical gene expression levels histograms, we must ¯rst eliminate the experimental errors in SAGE libraries so the corresponding histograms will be unbiased. There are at list four technical problems of SAGE experiments that render the best estimates of the GELDF. The problems could be de¯ned as (Zhang et al, 1997; Velculescu et al, 1999; Lal, et al. 1999; Lash et al, 2000; Chen et al, 2000; Jones et al, 2001): (1) sampling errors in tag accumulation process, (2) nonrandomness of DNA sequences; (3) 5
chromosome coordinate of each SAGE tag, the strand, and the associated gene name(s) (where relevant) and the chromosome location of genes/ORFs.
intrinsic experimental errors including sequencing errors and ambiguities of tag-to-gene matching. The ¯rst problem is important for comparison of abundance of individual genes in cDNA or SAGE libraries and this problem is already discussed above. We have found that as a library size increases, the proportion of rare transcripts becomes smaller (Figure 1C, Figure 2) and the shape of the density function changes systematically (parameter b becomes more positive; Table 1).
In our analysis, the 10 bp sequences immediately downstream of the 3' NlaIII site found within a gene or de¯ned ORF or within 500 bp downstream adjacent genomic region with 3' NlaIII site have been taken as \ true SAGE tags". Thus, the "true" tags are those tags that match ORFs/genes or the adjacent region, but do not match any non-coding regions or opposite (nontranslated) strand.
Sampling of the transcripts demonstrates a randomness of typical SAGE library or cDNA library construction. In particular, the same SAGE library taken at two different stages of sequencing process exhibited nonuniform changes in the expression levels of di®erent genes (Figure 2A). However, the di®erences between frequency distributions of the transcript numbers on sequence stages of the library construction disappeared after Monte-Carlo sub-sampling (see Methods) the larger data-set (library 2892b) to obtain a sample of smaller library size (library 2892a) (Figure 2A). A similar property was obtained in yeast SAGE libraries and in the di®erent human cDNA libraries (data not presented).
Thus, in the Tag Location database we can consider the two major categories of erroneous SAGE tags: (1) the \outside" erroneous tag- SAGE tag raised due to construction and sequencing errors (or the \nongenome" erroneous tags) and (2) the \inside" erroneous tag matched non-coding DNA regions (the \genomeassociated" tag matched, for example, tags in the antisense orientation (Jones et al., 2001) including untranslatable RNA molecules (Altuvia and Wagner, 2001) or \suspected" tags which can match as coding and noncoding regions. Additionally, no tags exist for some genes/ORFs in a current database (for example, genes without NlaIII sites). Note that a single true tag can match more than one gene/ORF or adjacent region as well as several true tags can match only a single gene/ORF. Figure 3 illustrates these multi-matching problems.
The inherent experimental errors and ambiguities of tag-to-genes and tags-to-gene matching are more di±cult and both needed more serious consideration. What fraction of transcripts which can be isolated has potential biologically meaningful function? Here we will answer this important question for yeast transcriptome. Since almost all yeast protein-coding genes and open reading frames ORFs (an ORF is a DNA sequence which is (potentially) translatable into protein, i.e. likely to be a gene) are known and are located on chromosome, we can obtain the \true" distinct tags and their expression levels in a yeast SAGE library by eliminating the erroneous tags that fail to match known genes/ORFs in the Tag Location database for yeast transcriptome [www.sagenet.org, http:genome-www.stanford.edu/Saccharomyces]. This database was generated by Velculescu et al. (1997) and currently contains information about s8,500 distinct SAGE tags, match ~4,700 genes/ORFs (of s6200 known genes/ORFs in the yeast genome), together with the
Table 2 shows the numbers of tags, the number of distinct tags, and the number of yeast genes/ORFs presented in di®erent categories of tags of the Tag Location database. This table shows that s20% of 59494 tags are erroneous tags. Antisense tags, presented in the last column in Table 2, consist a signi¯cant fraction (s50%) of 8280 \inside" erroneous tags corresponding to 77% of 1961 gene/ORF sequences matched by all \inside" erroneous tags. It was assumed that most of these antisense SAGE tags are caused by misprinting of the oligo(dT) to internal poly(A) stretches after ¯rst-strand synthesis during SAGE library construction (Jones et al., 2001). However, a minor fraction of highlyabundant transcripts in C: elegans, in which both the
6
site within the ORF or downstream adjacent region sequences associated with 3813 of 4244 ORFs. However, 411 (7%) of 5819 \true" distinct tags represented by two genome sites associated with 742 of 4244 ORFs. If singleton tags are ignored, the latter result indicates that signi¯cant fraction of \true" distinct tags match more the one gene/ORF.
antisense and sense SAGE tags were found, might not be an artifact which are caused by misprinting (Jones at al, 2001). We found also highly-abundant antisense transcripts and, additionally, a signi¯cant correlation between such antisense and sense tags in SAGE yeast database (Kuznetsov, unpublished data). The last row in Table 2 and Figure 4 show that signi¯cant fraction of genes/ORFs could multiply matched by the di®erent categories of tags. Figure 4A shows population frequency distributions of \true" tags, \inside" erroneous tags and antisense tags within corresponding genome regions. Such a frequency distribution re°ects a probability that a SAGE tag occurs within a given genome region 1, 2,... times. For entire genome, this is a proportion of 3' NlaIII restriction sites located within a given subset of genome region, for example, within proteincoding regions. Figure 4A shows that 63.4% of 4244 genes/ORGFs are represented by true tags only ones, 26.3% - two times, 7.8% - three times, 2% -four times, 0.4% - ¯ve times and 0.07% - six times. The frequency distribution of \outside" erroneous tags is signi¯cantly di®er: this distribution shows tags matched many times the some (non-coding) genome regions. However, a fraction of such tags is relatively small and mostly associated with tags matching the genome regions outside of known genes/ORFs. For all our cases the distributions of SAGE tags in coding and non-coding regions does not follow power low and GDP distribution. This result consists with observations for other DNA-related phenomena (Borodovsky and Gusein-Zade, 1989; Li, 1999). Interestingly, the GDP model ¯ts well the frequency distributions of the expression levels of di®erent classes of erroneous tags, as well as frequency distribution of true tags. For example, a frequency distribution of 5819 true distinct tags for pool library ¯tted using the GDP model at k = 0:967 § 0:0001; b = 0:494 § 0:0003; and ª = 7:7:
Figure 5A shows the empirical histogram of the 5,303 distinct tags represented by 19,527 tags in the yeast library derived from G2/M phase-arrested cells (±), and of the 3,200 "true" distinct tags (²) of the same library after the elimination of 2,103 distinct tags associated with 3,239 tags that did not match ORFs or its adjacent genomic regions. Most of these erroneous tags occur with only 1 or 2 copies (Table 3, Figure 5B). These erroneous tags comprise 16.6% of the 19,527 tags in the library and might be considered as a sum of sequencing erroneous tags and a false-positive tags matching the non-coding regions. The last source of errors makes up s10% of library size. Figure 5 and Table 3 shows the GDP model at b = 0 (simple power law) ¯ts well a frequency distribution of di®erent classes of erroneous tags, but b > 0 in the case of frequency distribution of true tags. Matches of many distinct tags to the same gene, and one distinct tag to many genes, constitute serious and common problems in correctly identifying genes and properly determining their expression levels (Velculescu et al, 1999; Lal et al., 1999; Lash et al, 2000; Chen et al., 2000), particularly in larger SAGE libraries. Such matching confusions are associated with using shortlength (10 nucleotide) tags and with the existence of multiple restriction sites on the 3'end of sequences (Velculescu et al, 1999; Lash et al, 2000). Thus, we have tags that redundantly match the same genes/ORFs as do other tags as well as tags that match several di®erent genes/ORFs. These problems are, obviously, more acute in the case of higher organisms due to the higher complexity of their genome. Figure 6A shows that the di®erence between the growth curves for "true" distinct tags and for ORFs matched by "true" tags rapidly increases for M > 10; 000: This di®erence re°ects a rapid increase in the mean number of distinct \true" tags per gene as library size increases.
Figure 4B shows the histograms of the proportions of distinct tags represented by one, two, etc. genome sites for distinct tags for populations of \true" tags, \inside" erroneous tags, and antisense tags, respectively. Figure 4B shows that distribution of genome sites matched by tags are signi¯cantly di®erent for coding regions and non-coding regions. Figure 4B shown that 5357 (92%) of 5819 \true" distinct tags represented by one genome 7
et al, 1997), which also was based on tabulating the distinct ORFs found in the yeast Tag Location database.
Importantly, tags that matched only non-coding DNA regions and \redundant" tags apparently have not been correctly discarded in any recent predictions of the number of expressed genes in cell types. Therefore basing such estimates on uncorrected bigger human SAGE libraries (100,000-600,000 tags) must lead to a signi¯cant over-estimation of the number of expressed genes in human cell types (see, for example, Velculescu et al, 1999).
4.1
4.2
Estimating the GELPF for a Single Yeast Cell
First, we used the BD-model (Eq.(18); Appendix) with the ¯tted parameters c = 0:579 § 0:010 and d = 6; 580 § 190 in p1 (M ) to compute values p1 ; :::; p6 for 3,009 ORFs corresponding to the library size Mcell = 15; 000. Then we ¯t the GDP model (Eq.(1)) to these 6 points and extrapolated the ¯tted GDP model to estimate values of pm for m > 6. This use of the GDP model was necessary because numerical algorithms cannot accurately and reliably compute values of high-order derivatives. However, when we ¯t the GDP model and the BD model to the same empirical histograms for \true" distinct tags, we observed that the GDP model is a good approximation of the BD model (data not shown). Moreover, both the BD model and the GDP model are power law forms with similar values of shape parameters on the left tail of the probability distribution (see Eq.(1), Table 3, and Eq.(11). These observations justify using the GDP model to estimate pm for larger m. To check the self-consistency of our predictions, we, additionally, estimated the total number of transcripts, M , from the ¯tted GDP model and noted that the result was 15,000.
Estimating the Number of Genes in Yeast Cells
We found that the LG model (Eq.(13) and Eq.(15); Appendix) ¯t both the size-dependent data for \true tags" and for ORFs/genes (Figure 6A). In the case of "true" distinct tags (+, Figure 6A) (but where tags-to-gene and tag-to-genes multiple matches were not considered), the LG model predicts a very large value, 25,103§2; 000 genes (by Eq.(17); Appendix) with d=20,000§1; 946; c=0.356§0:02) in the large yeast cell population. When we tabulated the distinct ORFs that correspond to these "true" distinct tags at various sample sizes (±, Figure 6A) and ¯t this M vs. N data, the ¯tted LG model predicts 7,025§200 genes/ORFs with c=0.579§0:010 and d=6,580§190) in a yeast cell population. This estimate is s4-10% higher than current estimates of the total number of distinct ORFs in the yeast genome (6,200-6,760 genes/ORFs) (Johnson, 2000; Cantor and Smith, 1999). This di®erence could be due to the small number of erroneous tags and redundant tags which nevertheless match genes/ORFs and their adjacent genomic regions. Our analysis does not take in to account missed ORFs within the yeast genome (in particular, shorter ORFs), and overlapped ORFs. Additionally, about 13% of transcripts would be expected to lack an NlaIII anchoring enzyme site and would therefore be missing in the database. Using an estimate of the number of mRNAs per yeast cell (Mcell = 15; 000 (Velculescu et al, 1997), Eq.(17) predicts 3,009 ORFs per cell. This estimate is consistent with the number of genes/ORFs for a single yeast cell in the G2/M phase-arrested state (2,936 ORFs matched by "true" distinct tags in this library) and with a published estimate of ORFs for a single yeast cell in the log-phase of cell growth (Velculescu
Figure 6B shows the predicted GELPF at all possible levels of gene expression for a single yeast cell. The stepfunction (solid line) represents the relative frequencies estimated by the BD model (step-function, solid line) for low-abundance genes consisting of 85% of s3,000 genes/ORFs in a yeast cell. The GELPF was estimated with the use of the GDP model for larger abundance genes. The theoretical histogram (²) in ¯gure 6B was generated in 3,009 Monte Carlo experiments by sampling from ¯tted GDP distribution and by counting the numbers of genes/ORFs found at a same expression value. Figure 6B shows that 38% of s3,000 expressed genes are represented by a single mRNA copy per cell. Moreover, ¯gure 6A shows that a given single cell (at M =15,000 transcripts per cell) expresses only 45% of all protein8
et al, 1996; Jelinsky et al; 1999). Observed deviations between our two gene expression level distributions could also be related to di®erences in experimental conditions, experimental normalization procedures, and cell types. However, both experimental techniques provide skewed Pareto-like distributions for all reliably observed transcripts.
coding genes; the other 55% of all protein-coding genes are expressed at very low levels (< 1 copy per cell). We used data obtained by GeneChip technology (Jelinsky et al; 2000) to construct the empirical histogram of the gene expression levels in untreated log-phase yeast cells (Figure 6B). This histogram was constructed as follows: for each OFRF/gene, we converted the scaled hybridization intensity signal value, I, in the yeast GeneChip database [9], to the number of mRNA molecules per single yeast cell by the empirical formula m = (I¡20)=165. The conversion shows close agreement with the estimates of transcript numbers per cell for of 16 di®erent yeast genes (Holstege et al, 1998) observed in three di®erent yeast GeneChip data bases (Holstege et al, 1998; Jelinsky et al; 1999; Jelinsky et al; 2000). Then summing of m-values in the unit intervals centered at 1,2,..., 143 (an estimated value of the maximum gene expression level in the log-phase yeast cell estimated for the library) produce gene expression levels for a single yeast cell characterized by the GeneChip.
5
Discussion
The very skewed form of empirical histograms of the gene expression levels we studied are often observed in a variety of systems that have many di®erent, and mostly rare, species and demonstrate stochastic behavior. Such distributions have been used in many areas such as linguistics, astrophysics, internet networking, business, and ecology (e.g. the word frequency in a text, or the abundance of species in a population) (http://linkage.rockefeller.edu/wli/zipf). In particular, simple power law distribution models have been used for various "DNA-related" phenomena presented in the "linguistic" or the rank-frequency form (Mantegna et al, 1994; Li, 1999). For example, coding and non-coding sequences (Manegna et al, 1994; Stanly et al, 1999), and ORF lengths in genomes (Borodovsky and GuseinGade, 1989; Li, 1999). However, systematic deviations of these models from empirical data have been reported (Borodovsky and Gusein-Gade, 1989; Li, 1999). We have observed that SAGE tags matching coding and non-coding DNA regions (including antisense tags) are di®erently distributed on yeast genome sequence, and both of them does not follow to Pareto-Zipf-like laws (Figure 4).
We then obtained a gene expression levels histogram (Figure 6B). The entire expression level ranges contained s3,000 expressed genes/ORFs representing s16,000 transcripts per cell. Figure 6B shows that the frequency distribution for the GeneChip data also follows the GDP model (k=0.86§0:001, b=0.37§0:003 at MSC=7.4). Similarly, skewed frequency distributions were also observed in other (untreated) yeast cell GeneChip libraries found in (Holstege et al, 1998; Jelinsky et al; 1999). Thus, the distribution predicted by our analysis of SAGE data and our estimated frequency distribution based on GeneChip data are close to each other (see Figure 6B). A larger fraction of unique transcripts in the case of GeneChip data (s45% vs 38% in our SAGE data distribution) is expected because the GeneChip method is more sensitive in determining, at least, low-abundance genes (Holstege et al, 1998). A relatively small systematic di®erences between the tails of the two distributions might be because the hybridization intensity score does not strongly linearly correlate with the target molecule concentration for highly abundant transcripts (Lochart
This paper has demonstrated that the empirical histograms of gene expression levels for yeast cells in various cell cycle stages and for all analyzed human and mouse cell types, are well described by a "generalized" power law, called the Binomial Di®erential (BD) distribution. For a given sample size, this skewed distribution is approximated by the GDP model. We also found that the empirical histograms of gene expression levels change in the same way for many cell types or cell states as the number of transcripts in a li9
ing their exons) in a large same-type cell population in global transcriptional response of cells due to internal random perturbations a cell. In normal yeast libraries (Jelinsky et al; 2000), we observed that only s250 ORFs of s6200 yeast ORFs/genes are not detected. About 100 of these 250 ORFs are classi¯ed as questionable ORFs and, additionally, more than 50 other of 250 ORFs are classi¯ed as hypothetical protein ORF. Treatment with 6 di®erent damaging factors (Jelinsky et al, 2000) shows that only s100 yeast ORFs was still not observed using GeneChip technology. However, most genes/ORFs are still represented by a very small number of transcripts.
brary changes (Figure 2A, Table 1). The skewed form and quantitative similarity of the empirical histograms of gene expression levels for any two same-size libraries, regardless of human cell type, suggest a common underlying GELPF, perhaps due to the action of a common stochastic mechanism for gene expression. This conclusion also applies to the BD model, which assumes that protein-coding genes in a cell are expressed sporadically and independently. Modeling SAGE experiments in yeast has allowed us to develop a method to estimate the cumulative numbers of expressed protein-coding genes and of erroneous and redundant sequences. After eliminating the erroneous tags and redundant tags, we estimated that s55% of all yeast protein-coding genes are expressed at very-low levels ( 1, we obtain
(6)
n(m + 1; M) ¼
j=1
n(1; M ) : (m + 1)m
(10)
Eq.(5) and Eq.(10) can used to estimate the probability pm that a randomly-chosen gene from f1; :::Nt g has exactly m transcripts in a given library, i.e. pm ¼ n(m; M )=N . Then, for large M and m > 1 we have
Now, using Eqs.(3)-(6) we can derive the recursion formula:
n(m; M) n(m; M + 1) ¡ = M M +1 n(m + 1; M + 1) m+1 M ¡ (m ¡ 1) M +1 m¡1 n(m; M) ¡ ; M ¡ (m ¡ 1) M
m¡1 : m+1
pm+1 ¼ p1 =[(m + 1)m]:
(11)
The probability function pm has a skewed form, and is approximated by the power law form (pm s m¡2 ; LotkaZipf law, http://linkage.rockefeller.edu/wli/zipf), which describes many other large-scale, complex phenomena such as income, word occurrence in a text, numbers of citations to journal article, etc.
(7)
where m 2 f0; 1; :::; M g: Also, n(m; M ) = 0, if m > M: These results allow us to compute n(m; M) for any given values of m and M: Using Eqs.(3) -(7), we can re-write n(m; M ) in terms of N and M as follows:
When M is large enough, we can approximate Eq.(8) with its continuous analog and obtain the probability function pm , in terms of M and N as follows:
n(0; M ) = Nt ¡ N (M ); n(1; M ) = M(rN (M )); n(2; M) = ¡
M (M ¡ 1) 2 (r N (M )); 2
pm
(¡1)m+1
.. .
n(m; M) = (¡1)m+1
M! (rm N (M )); m!(M ¡ m)!
¼ h(m) :=
1 M! dm N ; N m!(M ¡ m)! dM m
(12)
where m = 1; 2; :::. The function h(m) with the parameters M and N taken as function of M will be called the binomial di®erential (BD) probability function. Taking m = 1 in this function, we obtain a di®erential equation:
(8)
13
Thus, unlike the ¯xed GDP models, the BD probability function depends on the number of distinct genes, N , and the library size, M ; it also yields the ¯nite value Nt for the total number of genes as M! 1. Eqs.(13)-(18) were used to exclude intrinsic experimental errors and redundancies present in yeast SAGE libraries.
dN N = p1 (13) dM M with N (1) = 1. We call the function N de¯ned by Eq.(13) the population \logarithmic growth" (LG) model. Note Eq.(13) could be re-written in the following explicit form: N
t X dN = qj (1 ¡ qj )M¡1 ; dM j=1
(14)
where the right side is a sum of geometric distribution probabilities of an initial success in a sequence of M trials. However, the values of qj and Nt are unknown. Using Eqs.(13)-(14), we can show that p1 is a monotonically decreasing function of M. We will use the empirical approximation
p1 =
1 + (1=d)c ; 1 + (M=d)c
(15)
where the c and d are positive constants (see Figure 6A). This function was selected among many possible forms for p1 (M ) by ¯tting the LG model to data points obtained from many yeast and human SAGE libraries and sub-libraries (not presented). Using an explicit speci¯cation of p1 allows us to ¯t the BD model to empirical histograms. The parameter 1=c roughly characterizes the rate of accumulation of genes (or gene tags), and the parameter d roughly estimates the maximum number of genes (or gene tags), Nt . With the above empirical choice for p1 ; Eq.(13) has an exact solution:
with
c µ ¶ 1+1=d c c 1 + 1=d N (M ) = M c 1 + (M=d)c
lim N (M ) = Nt = (1 + dc )
M!1
1+1=dc c
(16)
:
(17)
We now have an explicit, although complicated, expression for the BD probability function:
M! 1 ¢ N m!(M ¡ m)! c µ ¶ 1+1=d c c c 1 + 1=d M : (18) 1 + (M=d)c
h(m) = (¡1)m+1 dm dM m
14
Figure Legends Figure 1: Fitting the empirical relative frequency distributions of the gene expression levels. Log-log plots. (A) ±: frequency of expression levels for the log-phase yeast cell growth library of size 20,096 SAGE tags; solid step line : best-¯t Generalized Discrete Pareto (GDP) model with parameters k = 0:974 § 0:004; b = ¡0:173 § 0:004; dotted direct line links (for guidance) the best-¯t values for m = 1; 2; ::. at k = 1:03 § 0:005; b = 0: (B) ± : frequency of expression levels for mouse mammary cell cDNA library 341 of size 36,675 ESTs; solid step line: best-¯t GDP model for ± data with parameters k = 1:44 §0:006; b = 1:34§ 0:002. (C) Log-log plot. 1,± : frequency of expression levels for human normal brain cell library 154 of size 81,516 tags; solid step line: best-¯t GDP model for ± data; 2, empty square and black square: the average frequency of expression levels for 10 sub-libraries of size 6,313 tags taken at random without replacement from library 154 and represented in an average by 3497 distinct tags; dashed line: best-¯t GDP model for these data with parameters k = 1:62 § 0:07; b = 0:01 § 0:004; black square indicates a frequency value which is signi¯cantly di®erent (at p < 0:05) from corresponding frequency value in human prostate cancer library 2892a with size 6,313 tags represented by 3,531 distinct tags (library 2892a data is not presented). (D) ± : frequency of expression levels for human choriocarcinoma cell cDNA library 2,427 of size 10,087 ESTs; solid step line : best-¯t GDP model for ± data with parameters k = 1:88 § 0:044; b = 1:34 § 0:005. MV is a missing value. Figure 2: Size-dependence and equivalence of frequency distributions of the transcript copy numbers for human prostate carcinoma cell line LNCaP SAGE library. (A): Data for two times during the sequencing e®ort: library 2892a with 6,313 tags (² data; |: best ¯t GDP model) and library 2892b with 22,637 tags (± data; ¢ ¢ ¢ : best ¯t GDP model); (B): Equivalence of frequency distributions for library 2892a (6,313 tags; ² data; solid line: best-¯t GPD model) and for the same-sized sub-library (6,313 tags) randomly sampled from the library 2892b (22,637 tags; ± data). Figure 3: Di®erent sources of erroneous SAGE tags and ambiguities of tag-to-genes and tags-to-gene matching. Figure 4: Analysis of frequency distributions for the yeast SAGE database. (A) \True" tags (±), \inside" erroneous tags (²), and antisense tags (4); (B) Histograms of the proportions of distinct tags represented by one, two, etc. genome sites for distinct tags for populations of \true" tags (±), \inside" erroneous tags (²), and antisense tags (4), respectively. Figure 5: Decomposition of the frequency distribution of gene expression levels for SAGE yeast cells library. (A) Log-log plot. ± : the numbers of 5,303 distinct tags represented by 19,527 tags in G2/M phase-arrested cells library; dashed line : best-¯t GDP models (with b = ¡0:195 § 0:005; k = 0:96 § 0:006) for ±-data; ² : the numbers of "true" tags of the same library after removing erroneous tags; dotted line: best-¯t GDP model (with b = 0:207 § 0:013; k = 0:991 § 0:011) for ²-data. (B) Frequency distribution of \true" tags and best ¯t GDP model. ², dotted line: \true" tags; ±, solid line: \outside" erroneous tags; +, discontinue line: \inside" erroneous tags. Fitted probability function values (counted at m = 1; 2; :::) are linked by lines for guidance of the visual presentation of the models. Figure 6: Population growth curves and estimates of the GELPF for a single yeast cell. (A) Growth curves. + : the number of \true" distinct tags of sub-libraries from the pooled yeast library of 49,073 "true" tags; dashed line: best-¯t LG model (with d = 20; 000 § 1; 946; c = 0:356 § 0:02) for +-data; ±: the number of distinct ORFs found in these sub-libraries; best-¯t LG model (with d = 6; 575 § 185; c = 0:579 § 0:01) for corresponding ORFs data; short-dashed line: the number of redundant "true" tags. (B) Log-log plot. Solid step line: the fraction of ORFs by
15
the BD model for a single yeast cell; ² a histogram generated from ¯tted GDP model for 3,009 ORFs in a single yeast cell, estimated by SAGE data (Velculescu et al., 1997) and ± : a relative frequency of 3,009 ORFs in a single log-phase yeast cell, estimated by GeneChip data (Jelinsky et al, 2000). Dashed line links the ¯tted GDP model data points for m = 1; 2; ::: at k = 0:86 § 0:01; b = 0:37 § 0:003.
16
Sample Lib. 2427 Lib.2892a Lib. 166 Lib.2892b Lib. 161 Lib. 154 Lib.341 Yeast, G2/M Yeast, S-phase Yeast, log-phase Yeast, total
M 10087 6313 14616 22637 49334 81516 36675 19527 19871 20096 59494
N 3586 3531 5383 9348 15182 19137 8019 5303 5785 5324 11329
M=N 2.81 1.79 2.72 2.42 3.25 4.26 4.57 3.68 3.44 3.78 5.25
p1 0.54 0.81 0.70 0.74 0.59 0.53 0.42 0.67 0.67 0.66 0.62
J 246 78 462 221 832 1598 1641 519 561 636 1716
J=M 0.029 0.012 0.032 0.010 0.017 0.020 0.045 0.027 0.028 0.032 0.029
k § SE 1.88§0:04 1.05§0:015 1.28§0:01 1.08§0:03 1.44§0:01 1.25§0:012 1.44§0:01 0.96§0:006 0.98§0:004 0.97§0:004 0.94§0:008
b § SE 1.34§0:05 -0.48§0:008 0.015§0:02 -0.28§0:01 0.57§0:007 0.57§0:016 0.90§0:06 -0.195§0:006 -0.197§0:004 -0.173§0:004 -0.108§0:008
ª 7.1 10 10 11 8.3 7.1 5.0 8.8 9.8 9.3 7.7
TABLE 1: Fitting of the GDP-model to the empirical frequency distributions for cDNA and SAGE libraries of human cell tissues and SAGE libraries of yeast cells. k §SE; b § SE are the estimated parameters. MSC (model selection criterion) is the goodness of ¯t criterion. p1 is the fraction of distinct tags represented by one copy. Unilib identi¯cators: 2427 (choriocarcinoma, cDNA library (Life Technology method)), 2892a (LNCaP, prostate cancer cell line, SAGE libraries), 166 (normal colon, SAGE), 2892b (the prostate cancer cell line library 2892a after one-year upgrading), 161 (pooled normal brain tissues, SAGE), and 154 (normal brain cells, >95% white matter, SAGE). Library 341 is mouse mammary cell cDNA library (Life Technology method). Three yeast SAGE libraries (Velculescu et al, 1997) and a pool of these libraries are also presented.
17
Set M N Genes/ORFs
Pooled Tags 59494 100% 11329 100% 4735 100%
Outside Errors 3821 6.4% 2849 25.1% 0 0%
Inside Errors 8280 13.9% 2661 23.5% 1961 41.4%
True Tags 47393 79.7% 5819 51.4% 4244 89.6%
Antisense Tags 4235 7.1% 1689 14.9% 1504 31.8%
TABLE 2: Characteristics of di®erent categories of SAGE tags taken for analysis. The numbers of tags (M ), the number of distinct tags (N ), and the number of yeast genes/ORFs presented in the Tag Location database. The 3 libraries for log-phase, G2/M- and S-phases of cell cycle were pooled.
18
Sample G2/M Outside errors Inside errors True Tags Genes or ORFs
M 19527
N 5303
M/N 3.68
p1 0.67
J 519
k § SE 0.96§0:01
b § SE -0.195§0:006
ª 8.8
1447
1234
1.17
0.91
15
2.74§0:02
0.0
10.2
1792
869
2.06
0.76
182
1.59§0:01
0.0
8.5
16288
3200
5.09
0.55
519
0.99§0:01
0.21§0:01
7.3
15000
3009
4.99
0.382
288
1.56§0:04
2.17§0:08
7.7
TABLE 3: Decomposition of the Pareto-like distribution of G2/M-phase arrested yeast cells in SAGE library. Characteristics of distributions of di®erent classes of SAGE tags are as follows: the erroneous tags that fail to match the entire yeast genome sequences being \outside" errors, mostly associated with sequencing errors), the erroneous tags that fail to match known ORFs/coding regions or mapping within 500 bp adjacent downstream genomic regions (\inside" errors), and the \true" tags which contain a fraction of ambiguity matching tags.
19
A
Yeast cells
0.1 0.01 0.001 1e-4 1
10
100
1000
Frequency of distinct tags, f
Frequency of distinct tags, f
1
1
0.1
Human Brain Tissue 1
0.01 0.001
2 1e-4 1
Expression level, m (SAGE tags)
100
Expression level, m (SAGE tags) 1 B
0.1
Mouse mammary tissue
0.01 0.001 1e-4 1
10
100
1000
Frequency of UniGenes, f
1 Friquency of UniGenes, f
10
0.1
Human Choriocarcinoma
0.01 0.001 1e-4 1
10
100
Expression level, m (ESTs)
Expression level, m (ESTs)
Figure 1:
20
1
1 B
A 0.1
0.1 1
0.01
f
f
0.01
0.001
2
0.001
1e-4
1e-4
0/MD
0/MD 1
10
100
1
10 Expression Level (tags)
Expression Level (tags)
Figure 2:
21
100
t1 = “outside” err. tag t2= “inside” err. tag t3,…t8= “true” tags t1
t3
5’ Gene 1
t2
t4
Gene 2
t5
Gene 3 Figure 3:
22
t6
Gene 4
t7
t8
Gene 5 3’
100.00 True Tags
10.00
A
Inside Err.
ORFs,%
Antisense Tags
1.00 0.10 0.01
0
5
10
15
20
25
30
35
SAGE Tags
100.00 B
SAGE Tags, %
True Tags
10.00
Inside Err. Antisense Tags
1.00 0.10 0.01
0
5
10
15
20
25
Number of Genome Sites
Figure 4:
23
30
35
10000 1
1000 Relative Frequency
Number of distinct tags
A
100
10
1
0.1
0.01
0.001
1e-4 0/MV
1 10 100 Expression level, m (tags)
B
1
10
100
Expression level, m (tags)
Figure 5:
24
8000
1 B
6000
0.1 Fraction of Genes
Number of distinct tags or ORFs
A
4000
2000
0
0.01
0.001
1e-4
20000 40000 60000 80000 Library size, M (tags)
1
10
100
Expression level, (molecules/cells)
Figure 6:
25