Bayesian variable selection for gene expression ... - Springer Link

1 downloads 0 Views 210KB Size Report
Phlda1. 3.005376. Iapp_Iars. 0.134020627. Nvjp2. 0.128329295. (29380). (24476). (50872). Ptgds. 2.956879. Add2. 0.13603819. Cpb1. 0.12643678. (25526).
Neuroinformatics © Copyright 2006 by Humana Press Inc. All rights of any nature whatsoever are reserved. 1539-2791/06/95–118/$30.00 (Online) 1559-0089 DOI: 10.1385/NI:4:1:95

Original Article Bayesian Variable Selection for Gene Expression Modeling With Regulatory Motif Binding Sites in Neuroinflammatory Events Kuang-Yu Liu,* Xiaobo Zhou, Kinhong Kan, and Stephen T. C.Wong Harvard Center for Neurodegeneration and Repair—Center for Bioinformatics, Harvard Medical School, 220 Longwood Avenue, Boston, MA 02115; and Department of Radiology, Brigham and Women’s Hospital, Harvard University Medical School, 75 Francis Street, Boston, MA 02115

degree of similarity and the number of TFBSs was employed to calculate the score of each TFBS in each gene sequence. In the last step, linear regression and probit regression were used to build a predictive model of gene expression outcomes using these TFBSs as predictors. Given a certain number of predictors to be used, a full search of all possible predictor sets is usually combinatorially prohibitive. Therefore, this article considered the Bayesian variable selection for prediction using either of the regression models. The Bayesian variable selection has been applied in the context of gene selection, missing value estimation, and regulatory motif identification. In our modeling, the regressor was approximated as a linear combination of the TFBSs and a Gibbs sampler was employed to find the strongest TFBSs. We applied these regression models with the Bayesian variable selection on spinal cord injury gene expression data set. These TFs demonstrated intricate regulatory roles either as a family or as individual

Abstract Multiple transcription factors (TFs) coordinately control transcriptional regulation of genes in eukaryotes. Although numerous computational methods focus on the identification of individual TF-binding sites (TFBSs), very few consider the interdependence among these sites. In this article, we studied the relationship between TFBSs and microarray gene expression levels using both family-wise and memberspecific motifs, under various combination of regression models with Bayesian variable selection, as well as motif scoring and sharing conditions, in order to account for the coordination complexity of transcription regulation. We proposed a three-step approach to model the relationship. In the first step, we preprocessed microarray data and used p-values and expression ratios to preselect upregulated and downregulated genes. The second step aimed to identify and score individual TFBSs within DNA sequence of each gene. A method based on the

*Author to whom all correspondence and reprint requests should be addressed. E-mail: [email protected] 95

96 ___________________________________________________________________________________Liu et al. members in neuroinflammatory events. Our analysis can be applied to create plausible hypotheses for combinatorial regulation by TFBSs and avoiding false-positive candidates in the modeling process at the same time. Such a systematic approach provides the possibility to dissect transcription regulation, from a more comprehensive perspective, through which

Introduction The complex functions of a living cell are carried out through the concerted activity of many genes and gene products. A central goal of molecular biology is to discover the regulatory mechanisms governing the expression of genes in the cell (i.e., how the information coded at the genotypical level manifests at the phenotypical level). The expression of a gene is controlled by many mechanisms. A key factor in these mechanisms is transcription regulation by various proteins, known as transcriptional factors (TFs), which bind to specific sites that activate or repress transcription, in the promoter region of a gene. Genome sequences specify the gene transcription and translation activities that produce RNAs and proteins to support living cells, but how cells control global gene expression is far from transparent. The transcriptional regulatory apparatus is organized in the form of arrays of TF-binding sites (TFBSs) or motifs on DNA. Furthermore, transcriptional regulation in eukaryotic organisms requires cooperation of multiple TFs. Till date, most computational methods focus on identifying the components of this array, often a few TFBSs a time, rather than exploring their interdependence in transcription regulation. Recently, Conlon et al. (2003) developed a method called “motif regressor.” This novel approach combines binding site identification using position weight matrices, in particular using MDSCAN of Liu et al. (2002), and the linear regression approach for motif finding

phenotypical events at cellular and tissue levels are moved forward by molecular events at gene transcription and translation levels. Index Entries: Gene expression; Bayesian variable selection; DNA regulatory motifs; TFBSs; neuroinflammation. (Neuroinformatics

DOI: 10.1385/NI:4:1:95)

(Bussemaker et al., 2001; Keles et al., 2002; Tadesse et al., 2004). All of those approaches first identify potential regulatory elements from groups of genes separately with various methods and then use linear regression model to predict de novo regulatory motifs (Keles et al., 2004). Their goal was rather than investigating combinatorial regulation of multiple TFBSs. Therefore, modeling the relationships between binding sites and transcription regulation remains one of the challenging problems of contemporary biology. Our objective of this work is to model the relationship between TFBSs and gene expression levels. Given the complexity of TFBSs and transcription regulation, different approaches were needed for binding site detection step and regression models to identify most important regulatory motifs. Among discovered motifs from those references, often only very few of them are experimentally verified; we do not know whether other motifs truly represent TFBSs or not. The recent assessment by Tompa et al. (2005) also indicated the difficulty of regulatory element predictions with purely computational approaches in metazoans. Hence, in this study, we used TF library TRANSFAC Professional (BIOBASE GmbH, Wolfenbuttel, Germany) (Matys et al., 2003), which contained experimentally verified binding site information, to identify highly likely TFBS candidates. We then studied the relationship between gene expression and TFBS candidates, without introducing too many false-positive candidates into the modeling process.

Neuroinformatics_________________________________________________________________ Volume 4, 2006

Bayesian Variable Selection for Gene Expression Modeling __________________________________________97

Our approach included three major steps. Because the actual microarray data were quite noisy and many expression values were often lost owing to defects, contaminations, and various problems (Zhou et al., 2003b), we first preprocessed microarray data using missing value estimation; then we used p-values and expression ratios to preselect upregulated and downregulated genes. The second step aimed to identify and score individual TFBSs within DNA sequence of each gene. A method based on the degree of similarity and the number of TFBSs was employed to calculate the score of each TFBS in each gene sequence. In the last step, linear regression and probit regression were used to build a predictive model of gene expression outcomes using these TFBSs as predictors. Given a large number of predictors to be used, a full search of all possible predictor sets is usually combinatorially prohibitive. To this end, this article considered the Bayesian variable selection for prediction using either of the regression models. The Bayesian variable selection has been applied in the context of gene selection (Lee et al., 2003), missing value estimation (Zhou et al., 2003b), and regulatory motif identification (Tadesse et al., 2004). In our modeling, the regressor was approximated as a linear combination of the TFBSs and a Gibbs sampler was employed to find the TFBSs with the strongest predictive power. The contribution of our work was twofold. First, the Bayesian variable selection has been applied with two regression models on spinal cord injury (SCI) data to study the relationship between gene expression levels and TFBSs in neuroinflammatory events. Second, various combinations of TFBSs have been taken into account explicitly through the Bayesian variable selection in the modeling process to study how they affected the relationship between TFBSs and gene expression. We considered a more comprehensive modeling approach in our study of neuroinflammatory events that included regression models with the Bayesian variable selection

for multiple subsets of expression data under various conditions of motif scoring and sharing. These steps were necessary in order to take into account for the coordination complexity of transcription regulation. Especially notable in our results were nuclear factor of κ light polypeptide gene enhancer in B-cells 1 (NF-κB) and paired box (PAX) families of TFs. These TFs demonstrated intricate regulatory roles either as a family or as individual members during transcription activation and repression of inflammatory events and immune response. Neuroinflammatory events have been shown to play an important role in many neurological diseases. Our analysis with the Bayesian variable selection created plausible hypotheses for combinatorial interaction among TFBSs that were based on experimentally verified binding site sequence information. Hence our approach also offers additional advantage of avoiding falsepositive candidates in the modeling process. The article is organized as follows. In the Methods, we describe the neuroinflammatory microarray data set, the preprocessing and preselection of gene expression data, the preparation of genomic sequence, the search of TFBSs, the conditions of motif scoring and sharing, the regression model formulation, and finally the parameter estimation using the Bayesian variable selection. The “Experimental Results” section provides modeling results and experimental analysis of various combinations of regression models as well as motif scoring and sharing conditions. The discussion describes the biological relevancy of our modeling approach and its results. The conclusion contains the conclusion remarks. The appendix gives further details in the distributional equations, Gibbs sampling assumptions, and derivation of the Bayesian variable selection.

Methods We have applied the Bayesian variable selection approach with two regression models to study the relationship between gene expression

Volume 4, 2006_________________________________________________________________ Neuroinformatics

98 ___________________________________________________________________________________Liu et al.

levels and TFBSs on the neuroinflammatory microarray data set of “antiinflammatory compound screening in SCI” (Pan et al., 2004). Inflammatory events and immune response have been shown to play an essential role in many neurological diseases. In particular, inflammatory response often exacerbates secondary tissue damage following SCI (Popovich and Jones, 2003). As a result, countertreatments in reducing inflammation and in sparing secondary damage have been actively investigated. The microarray experiment was aimed to understand better subsequent physiological events in secondary injury and to facilitate better treatments in the future. In particular, Pan et al. (2004) had shown that a consistent and unique gene expression profiles were associated with NS398, a highly selective cyclooxygenase-2 (COX-2) inhibitor. They also suggested that the overall effect of these upregulated genes could be interpreted as neuroprotective. Hence our modeling of SCI data was concentrated on the effect of NS398 on gene expression compared with vehicle (injury without treatment).

Preparation of Gene Expression Data and Sequence Data Gene Expression Data Antiinflammatory compound screening data set (Pan et al., 2004) is described briefly as follows. Cultured rat (Rattus norvegicus) spinal cord slices from acute contusion were used as tissue model of SCI. In the microarray experiment, groups of cultures were treated with each of the following five antiinflammatory compounds: (1) acetaminophen, (2) indomethacin, (3) glucocorticoid methylprednisolone, (4) NS398, and (5) a mixture of rat recombinant interleukin (IL)-1 receptor antagonist and soluble tumor necrosis factor receptor: Fc chimeric protein. Each group consisted of three separate cultures as independent replicates. The two control groups included uninjured tissue and vehicle-treated tissue cultures (injury without treatment). Further details

of the data set are available from NCBI Gene Expression Omnibus (GEO) Accession GDS419, Series GSE633 and the cited reference. The microarray platform for the screening experiment was the custom two-channel NGEL 2.12 rat oligo-microarray (GEO Platform GPL479). There are 4967 probes on the array, specific for 4803 rat cDNA clusters. Details of the oligonucleotide probes are available from http://base. rutgers.edu/. No normalization procedure was applied to the ratios of the two-channel fluorescence intensity microarray data.

Expression Data Preprocessing and Preselection Missing Value Estimate As a first step of preprocessing microarray data, we performed missing value estimation to all genes in the expression data. In short, a weighted K-nearestneighbor method selects genes with expression profiles similar to the gene of interest to impute missing values. If we consider the expression value of gene g, which is missing in one experiment, this method will find K = 10 other genes, whose values are present in the same experiment, whereas having their expression most similar to gene g in the rest of experiments at the same time. A linear combination from the K closest genes, weighted by their expression similarity, is then used to predict missing value (Troyanskaya et al., 2001; Zhou et al., 2003a).

Genes with Maximum Fold Changes The list of genes was screened further for the ones with p = 0 and maximum fold changes. Similar methods to preselect gene during the preprocessing step have been used by others (Conlon et al., 2003). We first calculated the two-sided p-values of each gene by the ratio of the difference in the means to the standard error of the difference in the means at (n1 – 1) + (n2 – 1) degrees of freedom (Student’s t-test). Only those genes whose p-values equal to zero were retained for further selection process. Next we created two maximum concordant sets of genes from the top

Neuroinformatics_________________________________________________________________ Volume 4, 2006

Bayesian Variable Selection for Gene Expression Modeling __________________________________________99

30 upregulated and top 30 downregulated genes ranked by fold change. A combined maximum discordant set of genes was made of the top 15 genes from each of the maximum concordant sets. As a result, there were three lists of 30 genes for the microarray experiment data set: upregulated concordant genes (Concordant+), downregulated concordant genes (Concordant–), and combined discordant genes (Discordant).

Sequence Preparation Three sequence databases for Concordant+, Concordant–, and Discordant were generated. Chromosomal location and sequence of these genes were collected from Project Ensemble (Ensemble, 2004), based on Rat Genome Sequencing Consortium (RGSC, 2004) R. norvegicus genome assembly Build 3.1 (RGSC3.1). Genome sequences with up to 800 bases upstream from the 5′ end of the sense-strand (+), were retrieved from Ensemble for each gene in the three lists. As reported by van Helden et al. (1998), the majority of regulatory sites in eukaryotes is located within 800 base pair (bp) upstream of the transcription start sites. Hence, similar choice of range has been made by others in search of regulatory motifs (Keles et al., 2002, 2004; Conlon et al., 2003; Tadesse et al., 2004). Biological experiments searching TFBSs in the proximal promoter region of genes in higher eukaryotes also suggest that 800 bp is often a reasonable choice (Azizkhan et al., 1993; Hagen et al., 1994; Birnbaum et al., 1995; Bigger et al., 1997; Majello et al., 1997; Noti, 1997; Rajakumar et al., 1998; Geiger et al., 2001; Ross et al., 2002; Yoo et al., 2002; Rajakumar et al., 2004).

Searching and Scoring of TFBSs TFBSs’ Search To search for potential TFBSs in the sequence database, MATCH (Kel et al., 2003), with vertebrate position-specific weight matrix (PSWM) and minimal false-positive cutoff values, and PATCH, with vertebrate sequence patterns, from

TRANSFAC Professional (BIOBASE GmbH, Wolfenbuttel, Germany) (Matys et al., 2003) were used to avoid false-positives of potential regulatory binding sites. TRANSFAC Professional utilizes core similarity score (CSS), matrix similarity score (MSS), information vector, two separate cutoffs of CSS and MSS (Kel et al., 2003), and reference patterns in their binding site search. CSS and MSS measure the quality of a match between the sequence and PSWM of a binding site. CSS uses the five most conserved consecutive positions within the matrix whereas MSS uses all positions. Information vector takes into account whether mismatches occur in less conserved regions within the matrix or in highly conserved regions. Cutoff values of CSS and MSS with minimal false-positives allow only the most promising potential binding sites whereas minimizing the number of false-positives. Owing to the improvement of CSS, MSS, information vector, two separate cutoffs of CSS and MSS, as well as reference patterns in their binding site search, TRANSFAC Professional was expected to have better performance to detect TFBSs than standard tools (Chen et al., 1995; Quandt et al., 1995; Prestridge, 1996) using only international unit of pure and applied chemistry consensus sequence or PSWM information of a canonical binding site (Kel et al., 1999). In addition, PSWM and consensus sequence used in TRANSFAC Professional were based on experimentally verified putative TFBSs. The combination of using both could complement each other’s strength and avoid unnecessary information reduction in the stage of motif detection. Individual similarity score and appropriate cutoffs are discussed further in the cited references (Kel et al., 1999; Matys et al., 2003). Duplicated binding sites of the same TFs on the same location along the sequence were removed from the search results. There are additional motif detection approaches such as context-based search (Grabe, 2002) and hidden Markov model (Grundy et al., 1997) that may be used for motif detection in the

Volume 4, 2006_________________________________________________________________ Neuroinformatics

100 __________________________________________________________________________________Liu et al.

second major step. TRANSFAC was used in our study to take advantage of its strength in utilizing experimentally verified putative TFBS information, using both consensus sequence and PSWM, allowing difference in highly conserved regions and less conserved regions within the binding site motif, and minimizing the number of false-positive binding site candidates. Novel regulatory motif discovery approaches (Bailey and Elkan, 1994; Grundy et al., 1997; Liu et al., 2002; Bailey and Noble, 2003; Conlon et al., 2003; Kechris et al., 2004; Tadesse et al., 2004) can also be considered in its place and pooling of the obtained binding sites will offer an even richer class of potential binding sites for our modeling (Keles et al., 2004).

Motif Score In this study we considered two motif score definitions inspired by Conlon et al. (2003) and Keles et al. (2004). In the first definition, Sg,h is determined by how well the upstream sequence of a gene g matches a motif h, in terms of both degree of matching and number of sites by the following function: sg , h =

∑ Θ h ( x)

x ∈X g

(1)

in which x is the location, Xg represents all possible locations along the sequence on both strands of gene g, and similarity score Θh(x) is computed as ⎧ 1 if θ h ( x) ≥ θ h , cutoff Θ h ( x) = ⎨ o.w. ⎩0

(2)

in which θh and θh,cutoff are individual similarity score and appropriate cutoffs for motif h in the Section “Motif Score.” In the second definition, a dichotomized version of motif score defined in Eq. 1 is computed as ⎧1 if gene g has atleast one copy sg ,h = ⎨ of motif h o.w. (3) ⎩0

Consideration of Common Motifs Because the number of possible motifs (n) was large, we considered two situations of motif utilization in our modeling process: one was to consider all possible motifs among upstream sequences of all genes g = 1, …, m, i.e., the largest union, and the other one was to consider only a common set of motifs which were shared by at least six out of the total m = 30 genes.

Problem Statement of Modeling Let y denote the gene expression profile, in which y could be continuous, binary, or ternary expression profiles. As a continuous vector variable, y often represents the ratio of mRNA abundance under two different conditions, or its log transformation. As a discrete vector variable, y represents the class of genes, for example, 0 is for downregulated genes and 1 for upregulated genes in binary format. Although our modeling approach can be extended to ternary y, in which –1 is for downregulated genes, 0 for invariant genes, and 1 for upregulated genes, we have devoted our study here to binary y. We assume to have independent and identically distributed (i.i.d.) observations of random variable y. For any given set of potential binding sites, h = 1, …, n, and a list of genes g = 1, …, m, we define a covariate motif score matrix ⎡Motif1 Motif2 Motif3 ⎢ S S1 , 2 S1 ,3 ⎢ 1 ,1 S2 , 2 S2 , 3 X = ⎢ S2 ,1 ⎢ M M ⎢ M ⎢⎣ Sm ,1 Sm , 2 Sm , 3

L Motifn ⎤ L S1 ,nn ⎥ ⎥ (4) L S2 , n ⎥ ⎥ O M ⎥ L Sm , n ⎥⎦

in which the entries of this matrix are the sequence motif-matching score of motif h for gene g, Sg,h. Next we will present two models to study the relationship between gene expression and DNA TFBSs among those m genes: one is the

Neuroinformatics_________________________________________________________________ Volume 4, 2006

Bayesian Variable Selection for Gene Expression Modeling _________________________________________101

linear regression model employed to treat the continuous expression profile, and the other one is the probit regression model employed to treat the discrete expression profile.

Linear Regression Model In this section, we will discuss the case that y is a continuous vector. The following linear regression model is used to relate the expression level of target gene (yg) and the motif scores of the combination of TFBSs (Xg): yg = Xgβ + eg, g = 1, …, m

(5)

in which Xg denote the gth row of matrix X in Eq. 4, β = [β1, β2, …, βn]T is the vector of regression parameters and the i.i.d. noise eg follows eg ~ N (0,σ2).

Bayesian Variable Selection for Linear Regression Define γ as the n × 1 vector of indicator variables γj such that γj if βj = 0 (the variable is not selected) and γj = 1 if βj = 0 (the variable is selected). Given γ, let βγ consist of all nonzero elements of β and let Xγ be the columns of X corresponding to those of γ that are equal to 1. Further details of the assumptions and the posterior distributions in estimating γ, β, and σ2 are given in the subsection Bayesian Variable Selection for Linear Regression in the Appendix. The Gibbs sampling algorithm for jointly estimating parameters γ, β, and σ2 is as follows: 1. Draw γ from p(γ|y) in Eq. A3. In practice, each gj is sampled independently from

(

)

p γ j y , γ h ≠ j ∝ (1 + C )

− ny 2

⎡ 1 ⎤ γ exp ⎢− S( γ , y )⎥ π j j 2 ⎣ ⎦

(1 − π )

1− γ j

j

j = 1,L , n (6)

2. Draw σ2 from p(σ2|y,Xγ) in Eq. A4. 3. Draw β from p(β|y,Xγ ,σ2) in Eq. A5. In this study, the initial parameters were randomly set. T = 10,000 iterations were

implemented with the first 2000 as the burnin period to obtain the Monte Carlo samples {γ(t), σ2(t), β(t),t = 1, …, T}. We counted the number of times that each motif appeared in {γ(t), t = 2001, …, T}. The motifs with the highest appearance frequencies played the strongest roles in predicting the target gene. The details of fast implementation can be found from Zhou et al. (2003b).

Probit Regression Model For discrete expression profiles, we will employ a probit regression model (Zhou et al., 2003b) to study the relationship between the expression profile and TFBSs. Without loss of generality, y represents the class of genes in binary format, i.e., 0 for downregulated genes and 1 for upregulated genes. We assume to have independent and identically distributed observations of random variable y. For any given set of potential binding sites, h = 1, …, n, and a list of genes g = 1, …, m, we define a covariate motif score matrix in Eq. 4. Here we study motif selection problem using probit regression with the Bayesian variable selection. In the binomial probit regression, i.e., when the number of discrete states, K = 2, the relationship between the gene expression level yg and the motif scores of the combination of TFBSs Xg is modeled as a standard binomial probit regressor (Albert and Chib, 1993) which yields

( )

y g = Φ X gT β , g = 1, K , m

(7)

in which β = [β1, β2, …, βn]T are the regression parameters, and Φ is the standard normal cumulative distribution function. Introduce m independent latent variables z1, …, zm, in which zg~N(Xgβ, 1) i.e., zg = Xgβ + eg, g = 1, …, m

(8)

and eg ~N(0,1). Define γ as the n × 1 indicator vector with the jth element γj such that gj= 0 if

Volume 4, 2006_________________________________________________________________ Neuroinformatics

102 __________________________________________________________________________________Liu et al.

βj= 0 (the variable is not selected) and gj = 1 if bj = 0 (the variable is selected). The Bayesian variable selection is to estimate γ from the posteriori distribution p(γ|z) (Lee et al., 2003). When y is in ternary format, in which –1 is for downregulated genes, 0 for invariant genes, and 1 for upregulated genes, the general multinomial probit model will be used in the place of the standard binomial probit model. For further details of extending the Bayesian variable selection to the multinomial probit model, please see the section “Bayesian Variable Selection for Probit Regression” in the appendix.

• Draw zk = [zk,1, …, zk,m]T, k = 1, …, K, from a truncated normal distribution with optimal exponential accept–reject algorithm in (Robert, 1995) as follows: for g = 1, 2, …, m if , y g = k, then draw z k,g according to z j,g ~N(X γ β j , 1) truncated at the left by maxj=kzj,g, i.e.,

(

z j , g ∼ N ( X γ β j , 1)1{ z j , g ≤ zk , g }

Here we summarize the procedure of the Bayesian variable selection for the general multinomial probit model in which K ≥ 2. Further details of the assumptions and the posterior distributions in estimating γ, {βk,γ}, and {zk} are given in the section “Bayesian Variable Selection for Probit Regression” in the appendix. The Gibbs sampling algorithm for estimating parameters g, {bk,g}, {zk} is as follows: • Draw γ from p(γ|z1, …, zK-1. In practice, each gj is sampled independently from p(γj|z1, ..., zK–1, γh≠j) ∝ p(z1, ..., zK–1|γ)p(γj) −

( K − 1) n γ 2

⎫ ⎧ 1 K −1 exp ⎨− ∑ S ( γ , zk )⎬ ⎭ ⎩ 2 k =1

(

× π γj j 1 − π j

)

1− γ j

, j = 1, K , n , (9)

in which ng = ∑ h = 1 g h, c = 100, and πj = P(γj = 1) n

is the prior probability to select the jth motif. It was set as pj = 10/n according to the very small sample size. If πj took a larger value, then we found that often ( X γT X γ ) − 1 did not exist. • Draw βk from p(β k |γ , zk ) ∝ N (Vγ X γT zk , Vγ ), where

Vγ =

c ( X T X )−1 1+c γ γ

k , g > max j = k

(10)

zj ,g

}

(11)

else, if, yg = j and j = k, then draw zj,g according to zj,g~N(Xγβj, 1) truncated at the right by the newly generated zk,g, i.e.,

Bayesian Variable Selection for Probit Regression

∝ (1 + c )

){

zk , g ~ N X γ β k , 1 1 z

(12)

End for • Here we set zK,g~N(0,1) when yg = K, i.e., we introduce a new equation zk,g = XγβK + eK,g, g = 1, …, m with βK being a zero vector and eK,g~N(0,1). In this study, 10,000 Gibbs iterations were implemented with the first 2000 as burn-in period. Then we obtained the Monte Carlo samples as γ (t ) , β(kt ) , z(kt ) , t = 2001, K , T , where T = 10,000. Finally, we counted the number of times that each motif appeared in {γ(t), t = 2001, 2002, …, T}. The motifs with the highest appearance frequencies played the strongest roles in predicting the expression level of target gene. Fast implementation issues of this algorithm can be found from Zhou et al. (2004).

{

}

Experimental Results Expression Data Preprocessing and Preselection Neuroinflammatory microarray data set of SCI (Pan et al., 2004) has been described in the subsection Gene Expression Data. For SCI data, preprocessing of gene expression data and preselection of genes with maximum fold changes were carried out as outlined in the subsection Expression Data Preprocessing and Preselection. Other researchers have used similar methods to

Neuroinformatics_________________________________________________________________ Volume 4, 2006

Bayesian Variable Selection for Gene Expression Modeling _________________________________________103

preselect gene during the preprocessing step (Conlon et al., 2003). No normalization procedure was applied to the ratios of the twochannel fluorescence intensity data. On finishing the estimation of missing values, we preselected genes whose expression profiles changed significantly between the NS398 treatment and the vehicle (injury without treatment), using their associated p-values as selection criteria (p = 0). These 2574 genes, from a total of 4967 probes, were then screened for three lists of top 30 genes with maximum fold changes in terms of concordant upregulated expressions (Concordant+), concordant downregulated expressions (Concordant – ), and maximum discordant expressions (Discordant). The results of Concordant+, Concordant–, and Discordant genes, as well as their fold changes of expression levels between NS398 and vehicle, were given in Table 1. The list of Concordant+ genes included most of the genes considered neuroprotective by Pan et al. (2004). Among this list of genes, Ccl2 (NCBI Entrez Gene 24770) and Cxcl1 (NCBI Entrez Gene 81503) stood out as their expression levels increased by more than ninefolds, whereas the rest of them increased by twofolds to fivefolds. Chemokine (C–C motif) ligand 2 (Ccl2), is a monocyte chemoattractant protein, chemokine (C–X–C motif) ligand 1 (Cxcl1), acts as a neutrophil chemoattractant. Among Concordant– genes, Dlgap2 (NCBI Entrez Gene 116681) and Dhh (NCBI Entrez Gene 84380) were the two most downregulated ones; their expression levels were reduced by more than 10-folds. Disks large (Drosophila) homolog-associated protein 2 (Dlgap2), is a transmembrane guanylate kinase protein. Desert hedgehog protein (Dhh), may play a role in neuronal differentiation. Ccl2, Cxcl1, Dlgap2, and Dhh were all included in the Discordant gene list by definition.

Motif Selections To search for TFBSs, three sequence databases, made of 800-bp upstream regions from

the sense strand (+ strand), were first retrieved from ENSEMBL for the lists of Concordant+, Concordant–, and Discordant genes for both SCI data, as outlined in the subsection Sequence Preparation. van Helden et al. (1998) have reported that the majority of regulatory binding sites in eukaryotes is located within 800 bp upstream of the transcription start sites. Similar choices of range have been made by others who are also searching regulatory sites in the proximal promoter region of genes in eukaryotes through both computational approaches and biological experiments, as described in the subsection Sequence Preparation. MATCH and PATCH from TRANSFAC were then used to search for potential TFBSs among the sequence database, as described in the subsection Transcription Factor Binding Sites Search. Cutoffs of similarity scores were chosen to avoid excessive false-positives of potential binding sites. Table 2 gave the dimensions m × n of motif score matrix X for Concordant+, Concordant–, and Discordant genes data set, as specified in Eq. 4 and used in our Bayesian variable selection. For the 30 Concordant+ genes, there were 189 possible motifs detected among their upstream sequence database; 46 out of these 189 motifs were shared by at least six genes. For the 30 Concordant– genes, the total number of possible motifs was 205, and the number of common ones was 46. For the 30 Discordant genes, the numbers were 210 and 45, respectively. These motifs were the features to be selected by our Bayesian approach to model the relationship between gene expression levels and TFBSs. Motif selection results with the top 10 average appearance frequencies (i.e., the ones that play the strongest roles in predicting the expression level) were given in Tables 3–5 for Concordant+, Concordant–, and Discordant genes in SCI data. Within each table there were four conditions of motif scoring and sharing used in Eq. 4. We

Volume 4, 2006_________________________________________________________________ Neuroinformatics

104 __________________________________________________________________________________Liu et al. Table 1 The Lists of Concordant+ Genes, Concordant – Genes, and Discordant Genes, and Their Fold Changes of Expression Levels Between NS398 and Vehicle in SCI Data. NCBI Entrez Gene (LocusLink) ID is in the Parenthesis. Concordant+ Gene Ccl2 (24770) Cxcl1 (81503) Ptgs2 (29527) Pmp22 (24660) Ccl2 (24770) Vim (81818) Dcn (29139) Mgp (25333) Barhl2 (117232) Cxcl2 (114105) S100a6 (85247) Cubn (80848) Hmox1 (24451) Mpz (24564) Csh1 (53950) Phlda1 (29380) Ptgds (25526) Igfbp2 (25662) Gp38 (54320) Hba-a1 (25632) Map1b (29456)

Concordant–

Fold Change

Gene

Fold Change

9.547893

Dlgap2 (116681) Dhh (84380) Zic1 (64618) Neurog3 (60329) Wnt4 (84426) Ptpru (116680) E2f1 (399489) Chn1 (84030) Foxd4 (252886) Dbh (25699) Gp2 (171459) RGD1307055 (362472) Nrxn1 (60391) Cpb1 (24271) Nvjp2 (50872) Iapp_Iars (24476) Add2 (24171) Cacna1a (25398) Slit2 (360272) Adam6 (192271) Gstp2 (29438)

0.090745732

9.052632 5.04491 5.013294 4.816199 4.570118 4.289552 3.853659 3.663934 3.581609 3.402878 3.154762 3.132686 3.109966 3.056782 3.005376 2.956879 2.774775 2.635015 2.478632 2.398176

0.091025645 0.107205622 0.108045979 0.110294114 0.110552758 0.110799445 0.113153366 0.114854513 0.117730495 0.12381952 0.123989224 0.124620065 0.12643678 0.128329295 0.134020627 0.13603819 0.137583887 0.13765182 0.138138146 0.138248854

Discordant Gene Ccl2 (24770) Cxcl1 (81503) Ptgs2 (29527) Pmp22 (24660) Vim (81818) Dcn (29139) Mgp (25333) Barhl2 (117232) Cxcl2 (114105) S100a6 (85247) Cubn (80848) Hmox1 (24451) Mpz (24564) Csh1 (53950) Phlda1 (29380) Nvjp2 (50872) Cpb1 (24271) Nrxn1 (60391) RGD1307055 (362472) Gp2 (171459) Dbh (25699)

Fold Change 9.547893 9.052632 5.04491 5.013294 4.570118 4.289552 3.853659 3.663934 3.581609 3.402878 3.154762 3.132686 3.109966 3.056782 3.005376 0.128329295 0.12643678 0.124620065 0.123989224 0.12381952 0.117730495 (Continued)

Neuroinformatics_________________________________________________________________ Volume 4, 2006

Bayesian Variable Selection for Gene Expression Modeling _________________________________________105 Table 1 (Continued) Concordant+ Gene Glud1 (24399) Slc38a1 (170567) Ugdh (83472) Plat (25692) nearJunb (gb.X95094) Prx (78960) Atp1a2 (24212) Hspa1a (24472) Txnrd1 (58819)

Concordant–

Fold Change

Gene

2.306889

Capon (192363) Plau (25619) Ntn1 (114523) Ntf3 (81737) Pmf31 (171453) Tas2r41 (246219) Crygb (301468) Chrm3 (24260) Apaf1 (78963)

2.261708 2.24058 2.228183 2.164835 2.148148 2.138258 2.128721 2.079365

Table 2 The Number of Genes, m, as Well as the Numbers of Total, and Commonly Shared, Distinct TFs, n, in SCI Concordant+, Concordant–, and Discordant Data. NCBI Entrez Gene (LocusLink) ID is in the Parentheses Number of Genes (m) Data Set Concordant+ Concordant– Discordant

Number of TFs (n) (Features)

(Samples)

Total

Common

30 30 30

189 205 210

46 46 45

considered both score definitions as given in the subsection Motifs Score, whereas taking into account whether to use all possible motifs in the particular sequence database or just the common motifs shared by at least 6 out of 30 genes, as defined in the subsection “Consideration of Common Motifs.” In particular, the numbers of motifs n were given in Table 2 as described

Discordant

Fold Change

Gene

Fold Change

0.142857143

Foxd4 (252886) Chn1 (84030) E2f1 (399489) Ptpru (116680) Wnt4 (84426) Neurog3 (60329) Zic1 (64618) Dhh (84380) Dlgap2 (116681)

0.114854513

0.143564361 0.14473684 0.145429373 0.146387829 0.146525686 0.147157181 0.147651013 0.14787879

0.113153366 0.110799445 0.110552758 0.110294114 0.108045979 0.107205622 0.091025645 0.090745732

earlier. Hence, four combinations of motif scores and preselections have been analyzed: Condition 1. motif score from Eq. 1 with all possible motifs; Condition 2. motif score from Eq. 3 with all possible motifs; Condition 3. motif score from Eq. 1 with common motifs; Condition 4. motif score from Eq. 3 with common motifs. Among motifs selected more than 90% of the times, there were NF-κB (NCBI Entrez Gene 81736, 309452, ...) and SMAD (NCBI Entrez Gene 25671, 29357, 25631, 50554, ...) family-wise motifs for Concordant+ genes, as well as PAX4 (NCBI Entrez Gene 83630) and TFAP2C (NCBI Entrez Gene 362280) for Concordant– genes, in Tables 3 and 4. NF-κB, nuclear factor of kappa light polypeptide gene enhancer, is a transcription regulator that is activated by various intra- and extracellular stimuli such as cytokines, oxidantfree radicals, ultraviolet irradiation, and bacterial or viral products. SMAD is a family of

Volume 4, 2006_________________________________________________________________ Neuroinformatics

106

NF-κB (p65) (309452) NF-κB2 precursor (309452) RUNX1a (50662)

0.795

STAT4 (367264) TFAP2 (306862,301285, 362280,301284) RUNX1 (50662) GEN_INI (general initiator sequence) AP-1 (24516,24371) NF-κB2 (309452)

0.252

0.262

0.285

0.327

0.338

0.34

0.343

0.673

0.839

BVS Score

MSX1 (81710)

TF

Condition 1

NF-κB2 precursor (309452) HMG IY (117062) TF68 (24820)

NF-κB2 (309452) ELF-1 (85424)

SMAD (25671,29357, 25631,50554) ETS (24356) NF-Y (29508,25336, 25337) NF-κB (p65) (309452) Tel-2 (51513)

TF

0.138

0.196

0.264

0.271

0.31

0.31

0.347

0.357

0.386

0.95

BVS Score

Condition 2

YY1 (24919) TFAP2 (306862,301285, 362280,301284) STAT4 (367264) Sp1 (24790) ETV4 (360635)

GATA-4 (54254) USF (83586,81817)

GEN_INI (general initiator sequence) NF-κB (81736,309452) ZF5 (282825)

TF

0.383

0.387

0.411

0.494

0.512

0.54

0.633

0.761

0.964

0.997

BVS Score

Condition 3

CDX1 (171042,364883) RUNX1 (50662) Zic2 (361096)

Xvent-1 (394455) MSX1 (81710)

NF-κB (81736,309452) YY1 (24919)

GEN_INI (general initiator sequence) STAT4 (367264) ZF5 (282825)

TF

0.126

0.131

0.149

0.157

0.252

0.292

0.53

0.602

0.666

0.735

BVS Score

Condition 4

Table 3 The Top 10 Motif Selection Results and Their Bayesian Variable Selection (BVS) Scores of Conditions 1–4 for Concordant+ Genes in SCI Data. NCBI Entrez Gene (LocusLink) ID is in the Parentheses. The BVS Score of a Particular Feature was the Frequency it Appeared in the Posterior Samples, Expressed as a Proportion

107

0.524

0.394

CREB (81646)

Myogenin/NF-1 (25492) E2F1 (399489) CDX1 (171042,364883)

Rb/E2F1/DP-1 (24708,399489,7027) Fra-1 (25445)

0.3

XFD-2 (397954) Poly A

0.22

0.238

0.24

0.316

c-Fos (24371)

0.344

0.353

0.526

BVS Score

SRF (301242)

TF

Condition 1

POU1F1a (25517) LEF-1 (161452) Zic2 (361096) E47 (171046)

E2 (1496959, 1489020) HiNF-C (25988) TFAP2C (362280) STAT5A (homotetramer) (24918) CDP CR1 (116639)

PAX4 (83630)

TF

0.22

0.22

0.256

0.278

0.34

0.371

0.961

0.975

1

1

BVS Score

Condition 2

C/EBP (24252,24253, 25301) p53 Decamer (24842) USF (83586,81817) NF-κB (81736,309452) PAX4 (83630)

FOXM1 (58921) E2F1 (399489) CDX1 (171042,364883)

SRF (301242)

ZF5 (282825)

TF

0.269

0.27

0.324

0.328

0.347

0.522

0.552

0.59

0.637

0.864

BVS Score

Condition 3

PAX4 (83630) PAX2 (293992) FOXM1 (58921) Xvent-1 (394455)

E2F (399489)

TEF-1 (361630) ZF5 (282825) HNF-4 (25735)

SMAD (25671,29357, 25631,50554) CDX1 (171042,364883)

TF

0.212

0.279

0.298

0.312

0.334

0.389

0.456

0.466

0.674

0.794

BVS Score

Condition 4

Table 4 The Top 10 Motif Selection Results and Their Bayesian Variable Selection (BVS) Scores of Conditions 1–4 for Concordant– Genes in SCI Data. NCBI Entrez Gene (LocusLink) ID is in the Parenthesis. The BVS Score of a Particular Feature was the Frequency it Appeared in the Posterior Samples, Expressed as a Proportion

108

Pou1f1 (25517) T3R-β2 (24831) NF-κB1 (81736)

0.164

Zic2 (361096) Runx2 (367218)

0.112

0.143

0.148

0.164

0.176

0.21

0.219

0.252

0.384

BVS Score

GCM (29394,291047)

AFP1 (24177) TTF1 (25628) Hivep1 (117140) IRF-7 (293624)

TF

Condition 1

HOXA3 (24455,3200,15400) FAC1 (303617) p53 decamer (24842)

Octamer (300703,171068, 117058,116544, 364733) AREB6 (25705) Zic2 (361096)

Runx2 (367218) AFP1 (24177) PAX3 (114502) SRF (301242)

TF

0.186

0.193

0.196

0.211

0.219

0.22

0.245

0.432

0.444

0.458

BVS Score

Condition 2

RUNX1a (50662) MAZ (293501) Myc (24577)

KLF12 (306110) MSX1 (81710)

CREB (81646)

E2F1 (399489) NF-κB (81736,309452) RUNX1 (50662) AP-1 (24516,24371)

TF

0.105

0.121

0.192

0.2

0.207

0.228

0.252

0.337

0.34

0.347

BVS Score

Condition 3

RAR (24705,25271) SMAD (25671,29357, 25631,50554) Zic3 (7547,22773) MSX1 (81710) STAT4 (367264)

Myc (24577) LEF1 (161452) NF-κB (81736,309452) TFAP2 (306862,301285, 362280,301284) COUP (81808,25735)

TF

0.266

0.321

0.405

0.417

0.428

0.46

0.473

0.597

0.734

0.737

BVS Score

Condition 4

Table 5 The Top 10 Motif Selection Results and Their Bayesian Variable Selection (BVS) Scores of Conditions 1–4 for Discordant Genes in SCI Data. NCBI Entrez Gene (LocusLink) ID is in the Parenthesis. The BVS Score of a Particular Feature was the Frequency it Appeared in the Posterior Samples, Expressed as a Proportion

Bayesian Variable Selection for Gene Expression Modeling _________________________________________109 Table 6 The Top 10 Motif Selection Results and Their Bayesian Variable Selection (BVS) Scores of Conditions 3 and 4 Without Ccl2 and Cxcl1, the Top Two Genes in Concordant+ List of SCI Data. NCBI Entrez Gene (LocusLink) ID is in the Parenthesis. The BVS Score of a Particular Feature was the Frequency it Appeared in the Posterior Samples, Expressed as a Proportion Condition 3– TF TFAP2 (306862,301285,362280,301284) STAT4 (367264) GEN_INI (general initiator sequence) MSX1 (81710) NF-κB (81736,309452) ZF5 (282825) p53 decamer (24842) Zic1 (64618) PAX3 (114502) GC box (24790,25161,25162)

Condition 4– BVS Score 0.92 0.87 0.839 0.571 0.408 0.329 0.283 0.208 0.182 0.182

proteins similar to the gene products of the Drosophila gene “mothers against decapentaplegic” (Mad). SMAD proteins are signal transducers and transcriptional modulators that mediate multiple signaling pathways. PAX4 is a member of the PAX family, and human homologs of the Drosophila melanogaster prd gene. TFs encoded by the PAX family of genes play critical roles during fetal development and cancer growth. TFAP2C, TFAP-2γ, belongs to the family of AP-2 sequence-specific DNA-binding TFs (TFAP2s) that involve in the activation of several developmental genes. Because the expression levels of Ccl2 and Cxcl1, the top two genes in Concordant+ list of SCI data, increased by more than ninefolds, which were drastically different from the rest

TF

BVS Score

SMAD (25671,29357,25631,50554) NF-κB (81736,309452) STAT4 (367264) USF (83586,81817) PU.1 (366126) PAX4 (83630) p53 Decamer (24842) VDR (24873) GEN_INI (general initiator sequence) ZF5 (282825)

0.749 0.436 0.377 0.31 0.266 0.238 0.237 0.181 0.164 0.152

of Concordant+ genes, we took these two out and rerun our Bayesian variable selection. Motif selection results of Conditions 3 and 4 without Ccl2 and Cxcl1 were given in Table 6. There were two conditions within the table: Condition 3–—Motif score from Eq. 1 with common motifs without Ccl2 and Cxcl1; Condition 4 –—motif score from Eq. 3 with common motifs without Ccl2 and Cxcl1. Family-wise motif TFAP2 (NCBI Entrez Gene 306862, 301285, 362280, 301284, ...), TF AP-2, was selected more than 90% of the times, as shown in Table 6. There were a few regulatory motifs ranked consistently within the top 10 spots under various conditions and their results were summarized in Table 7. These motifs included

Volume 4, 2006_________________________________________________________________ Neuroinformatics

110 __________________________________________________________________________________Liu et al. Table 7 Summary of Motif Selection Results Across Conditions (1–4, 3–, 4–) and Lists of Genes (Concordant+, Concordant–, Discordant) for SCI Data. NCBI Entrez Gene (LocusLink) ID is in the Parenthesis. The BVS Score of a Particular Feature was the Frequency it Appeared in the Posterior Samples, Expressed as a Proportion Concordant+ NF-κB NF-κB1 (81736) NF-κB2 (309452) STAT4 (367264) PAX2 (293992) PAX3 (114502) PAX4 (83630) E2F1 (399489) CDX1 (171042,364883)

1

2

x

x

x

x

x

Concordant–

3

4

3–

x

x

x

4– x

1

2

3

Discordant 4

1

2

x

3

4

x

x

x

x

x

x

x x

x

NF-κB family-wise motif (NCBI Entrez Gene 81736, 309452, ...), NF-κB1 (NCBI Entrez Gene 81736), NF-κB2 (NCBI Entrez Gene 309452), STAT4 (NCBI Entrez Gene 367264), PAX2 (NCBI Entrez Gene 293992), PAX3 (NCBI Entrez Gene 114502), PAX4 (NCBI Entrez Gene 83630), E2F1 (NCBI Entrez Gene 399489), and CDX1 (NCBI Entrez Gene 171042, 364883). NFκB1 and NF-κB2 belong to the same NF-κB gene enhancer family. STAT4, signal transducer and activator of transcription 4, is a member of the STAT family of TFs that responds to cytokines and growth factors. Like PAX4, the PAX2 and PAX3 are also members of the PAX family whose protein product plays critical roles during fetal development. E2F1 is a member of the E2F family of TFs that plays a crucial role in the control of cell cycle and action of tumor suppressor proteins. E2F is also a target of the transforming proteins of small DNA tumor

x x

x

x

x

x

x

x

x

x

x

viruses. Caudal-type homeobox TF1 (CDX1) is a member of the CDX family of TFs that are often associated with cancer development in gastrointestinal tract. It should be noted that some of the regulatory binding site motifs are defined family-wise while other motifs are member specific. For example, here is a brief list of family-wise motifs followed by member-specific motifs: NF-κB— NF-κB1 (NCBI Entrez Gene 81736), NF-κB2 (NCBI Entrez Gene 309452), and so on; PAX— PAX2 (NCBI Entrez Gene 293992), PAX3 (NCBI Entrez Gene 114502), PAX4 (NCBI Entrez Gene 83630), and so on; SMAD—SMAD1 (NCBI Entrez Gene 25671), SMAD2 (NCBI Entrez Gene 29357), SMAD3 (NCBI Entrez Gene 25631), SMAD4 (NCBI Entrez Gene 50554), and so on; TFAP2—TFAP2A (NCBI Entrez Gene 306862), TFAP2B (NCBI Entrez Gene 301285), TFAP2C (NCBI Entrez Gene 362280), TFAP2D

Neuroinformatics_________________________________________________________________ Volume 4, 2006

Bayesian Variable Selection for Gene Expression Modeling _________________________________________111

(NCBI Entrez Gene 301284), and so on. Note that we do not attempt to list all possible members within each family.

Discussion We applied the Bayesian variable selection with two regression models to study the relationship between TFBSs and gene expression levels on microarray data of SCI. This neuroinflammatory microarray data set has been described in the subsection Gene Expression Data. Our three-step approach, as described in the rest of the Methods section, includes (1) the preprocessing and preselection of gene expression data, (2) the search of TFBSs on genomic sequence of preselected genes, and (3) the parameter estimation using the Bayesian variable selection. The results of Concordant+, Concordant–, and Discordant genes, as well as motif selections and their summary, were given in Tables 1 and 3–7. Next, we examine these results further and discuss their biological relevancy.

Gene Preselection Table 1showed the Concordant+, Concordant–, and Discordant gene preselection results of SCI data. Ccl2 and Cxcl1 stood out among Concordant+ genes with their expression levels increasing by more than ninefolds owing to the treatment of NS398. In comparison, the rest of them increased by only twofolds to fivefolds. Ccl2 (also known as MCP-1, Scya2, or Sigje) codes protein that binds to chemokine receptors CCR2 and CCR4. It belongs to a family of chemotactic proinflammatory activationinducible cytokines that act primarily on hemopoietic cells in immunoregulatory processes. It has been implicated in the pathogenesis of diseases characterized by monocytic infiltrations. Cxcl1 (also known as Gro1 or CINC-1) codes protein that responds to an infection or wounding. It is known to play a role in acute phase inflammatory response. On the other hand, all downregulated genes in the Concordant– data set had their expression

levels repressed by at least sixfold, showing the effect of NS398 as a highly selective COX2 inhibitor. Among them, both Dlgap2 and Dhh saw their expression levels decreased by more than 10-fold. Dlgap2 (also known as PSD-95) codes protein that functions in guanylate kinase activity as a catalyst of ATP + GMP = ADP + GDP reaction. It is a transmembrane guanylate kinase protein localized to postsynaptic densities in neuronal cells and may play a role in synaptogenesis. Dhh (also known as RGD:620711) is suggested to be involved in development of nervous system and myelination of nerve ensheathment, in addition to regulation of insulin production.

Motif Selection In our Bayesian variable selection approach, we used different regression models for continuous and discrete expression profiles. For regulatory motifs associated with Concordant+ and Concordant– genes, we used the Bayesian variable selection with linear regression. For motifs associated with Discordant genes, we used the Bayesian variable selection with probit regression. The first half of the Discordant data included the 15 most upregulated genes whereas the second half included the 15 most downregulated genes. These genes were thus labeled accordingly in the binomial probit model. Such dichotomization represented two categories of genes whose expression levels showed most extreme differences; the ratios of expression level fold changes between the two categories of Discordant genes varied between 25 for the closest-apart pair (Phlda1 and Nvjp2, NCBI Entrez Gene 29380, 50872) to 100 for the farthest-apart pair (Ccl2 and Dlgap2, NCBI Entrez Gene 24770, 116681). Therefore, we applied a discrete model to capture the information between the modulation of gene expression levels and TFBSs. Common to all discretized models, genes with moderate difference of fold changes will be categorized into the same class. On the other hand,

Volume 4, 2006_________________________________________________________________ Neuroinformatics

112 __________________________________________________________________________________Liu et al.

the general multinomial probit model that we have presented is applicable to discrete expression profiles with arbitrary levels, from binary format to ternary format, and more. Hence, our Bayesian variable selection with probit regression should still be able to capture necessary detailed information at ever finer scale. Further examination of Table 7 revealed that NF-κB family-wise motif, NF-κB1, NF-κB2, and STAT4 were mostly related to the upregulated genes. The exception was NF-κB under one condition of downregulated genes. NF-κB and NF-κB1 were also chosen to be important motifs in modeling Discordant genes among the family of motifs associated with NF-κB; using only common motifs, NF-κB was selected no matter which motif score definitions were used in our modeling. NF-κB family, including NF-κB1 (also known as KBF1, EBP-1, MGC54151, NFκB-p50, NF-κB-p105, or NF-κB) and NF-κB2 (also known as LYT10 or LYT-10), has been detected in numerous cell types that express cytokines, chemokines, growth factors, cell adhesion molecules, and some acute phase proteins in health and in various disease states. Activated NF-κB translocates into the nucleus and stimulates the expression of genes involved in a wide variety of biological functions. Inappropriate activation of NF-κB has been linked to inflammatory events associated with autoimmune diseases, whereas complete and persistent inhibition of NF-κB leads to inappropriate immune cell development, delayed cell growth, or apoptosis. Besides NF-κB, NFκB1, and NF-κB2, STAT4 was also indicated mostly by the upregulated genes. In response to cytokines and growth factors, members of STAT family are phosphorylated by the receptor-associated kinases, and then translocate to the cell nucleus where they act as transcription activators. In particular, STAT4 protein is essential for mediating responses to IL-12 in lymphocytes, and regulating the differentiation of T-helper cells.

In addition, the examination of Table 7 also shows that PAX2, PAX3, PAX4, E2F1, and CDX1 were mostly related to the downregulated genes. The exceptions included PAX3 and PAX4 under conditions of only common motifs for upregulated genes without Ccl2 and Cxcl1. PAX3 was also chosen to be important motifs in modeling Discordant genes under one condition. PAX2 is one of many human homologs of the D. melanogaster gene prd. PAX2 is believed to be a target of transcriptional suppression by the tumor suppressor gene WT1. Mutations within PAX2 have been shown to result in optic nerve coloboma and renal hypoplasia. Alternative splicing of this gene results in multiple transcript variants. Mutations in PAX3 (also known as WS1, CDHS, or HUP2) are associated with Waardenburg syndrome, craniofacial-deafness-hand syndrome, and alveolar rhabdomyosarcoma. Alternative splicing results in transcripts encoding isoforms with different C-termini. PAX4 is involved in pancreatic islet development and plays a role in differentiation of insulin-producing β cells. E2F1 (also known as RBP3, E2F-1, RBBP3, or GD:E2F1) protein binds preferentially to retinoblastoma protein pRB in a cell cycledependent manner. It can mediate both cell proliferation and p53-dependent/independent apoptosis. CDX1 is also known as CDXA. Members of this TF family are also involved in the vitamin D receptor and risk of fracture. It is not yet clear why the effect of selective COX-2 inhibitor NS398 is most pronounced on genes with these regulatory binding sites. The paradoxical roles of NF-κB and PAX4, revealed by the contradictory associations with both the upregulated genes and the downregulated genes in Table 7, demonstrated the complexity of transcription regulation. Transcription regulation of genes in eukaryotes is often controlled through multiple TFs in a wellcoordinated fashion. Such coordination may happen across different families of TFs or within the same family but among different members.

Neuroinformatics_________________________________________________________________ Volume 4, 2006

Bayesian Variable Selection for Gene Expression Modeling _________________________________________113

This is especially true for genes that contain TATA-less promoter regions (Azizkhan et al., 1993; Rajakumar et al., 2004). For example, binding activities and regulatory roles of Sp family (Sp1, Sp3, Sp4, NCBI Entrez Gene 24790, 25161, 25162) have been reported extensively. Several domains outside of the zinc finger region of Sp family proteins have been shown to mediate transcriptional activation through interaction with other TFs (Supp et al., 1996). In addition, an individual member within the same TF family can have the role of a transcription activator, repressor, or even both. These members may share common binding site motifs or even bind to the exact same site to cooperate or to counter each other’s action. As an example, although Sp1 and Sp3 bind the same elements in promoter regions, they may exert similar effects (Yoo et al., 2002) or opposite effects (Rajakumar et al., 1998) on gene expressions. In the mean time, both Sp1 and Sp3 can be activators or repressors in regulating gene transcription, whereas competing for the same sites or being synergistic to each other (Hagen et al., 1994; Birnbaum et al., 1995; Bigger et al., 1997; Majello et al., 1997; Noti, 1997; Geiger et al., 2001; Ross et al., 2002; Rajakumar et al., 2004). Therefore, in order to obtain a more thorough perspective of the interdependence of regulatory binding sites, it is crucial to model the relationship between TFBSs and gene expression levels using both family-wise and member-specific motifs, under various combination of regression models with the Bayesian variable selection, as well as motif scoring and sharing conditions.

Conclusion We have presented an application of both linear regression model and probit regression model to the problem of binding site motif selection in modeling the SCI gene expression data. In particular, we devised a three-step systematic analysis approach. The first step was to preprocess microarray data using missing value estimation and to preselect genes using p-values and expression ratios. The second step

was to identify and then score individual TFBSs within DNA sequence of each gene. Motif score was based on the degree of similarity and the number of TFBSs. In the last step, linear regression and probit regression were used to build a predictive model of gene expression outcomes using these TFBSs as predictors. The strength of our approach is to model the relationship between TFBSs and gene expression levels using both family-wise and memberspecific motifs, under various combination of regression models with the Bayesian variable selection, as well as motif scoring and sharing conditions, in order to account for complexity such as the coordination across different families of TFs, and among members within the same family of TFs. Further issues of the coordination complexity arises from the fact that an individual member within the same family might play the role of a transcription activator, repressor, or both, whereas using member-specific binding site motifs or family-wise motifs. Hence in this study we considered various combination of regression models with the Bayesian variable selection, as well as motif scoring and sharing conditions in order to explore fully the interdependence of TFBSs in their regulatory roles of gene expression. For the SCI gene expression data sets that we have analyzed, the top ranked motifs, NF-κB and PAX families of TFs in particular, demonstrated intricate regulatory roles either as a family or as individual members during transcription activation and repression in neuroinflammatory events. Our approach of regression models with the Bayesian variable selection can be applied to create plausible hypotheses for combinatorial regulation by TFBSs that were based on experimentally verified binding site sequence information whereas avoiding false-positive candidates in the modeling process. Such a systematic approach provides the possibility to dissect transcription regulation, from a more comprehensive perspective, through which phenotypical events

Volume 4, 2006_________________________________________________________________ Neuroinformatics

114 __________________________________________________________________________________Liu et al.

at cellular and tissue levels are moved forward by molecular events at gene transcription and translation levels.

Acknowledgments We thank Pan et al. for their SCI data. This work was supported by the Center for Bioinformatics (CBI) Research Grant, Harvard Center for Neurodegeneration and Repair (HCNR), Harvard Medical School to STCW. John Chow at HCNR–CBI has helped us with TRANSFAC installation.

sampler is employed to estimate the parameters. Denote S ( γ, y) = yT y −

(

c T y Xγ XγT Xγ c +1

)

−1

XγT y

(A1)

in which y = [y2, y2, …, ym]T. Define ∑ h=1 γ h . We have shown (Zhou et al., 2003b) that n

( )

p y γ ∝∫σ

{∫ β p(y β , σ ) p(β σ )dp(β )} γ

2

γ

γ

2

γ

p( σ 2 ) dσ 2

∝ (1+c)–nγ/2S(γ, y)–m/2

(A2)

Then the posterior distribution of γ is

Appendix

( )

To treat motif selection under the Bayesian framework, we make the following assumptions on the priors of the parameters in Eq. 5. First, given γ and σ2, the prior for βγ , is

(

(

β γ ~ N 0 , c σ 2 X γT X γ

( )

p γ y ∝ p y γ p ( γ ) ∝ (1 + c )

Bayesian Variable Selection for Linear Regession

)

−1

)

S (γ , y )

Here we introduce the Bayesian variable selection principle (Smith et al., 1997). A Gibbs

∏ π γjj (1 − π j ) n

1− γ j

(A3)

j =1

It can be shown further that the posterior distributions of S2 and β are given, respectively, by ⎛ m S (γ , y )⎞ p σ 2 y , X γ ∝ IG ⎜ , , 2 ⎟⎠ ⎝2

(

,

in which c is a constant. We set c = 100 empirically (Albert and Chib, 1993; Smith and Kohn, 1997; Lee et al., 2003). Given γ, the prior for σ2 is assumed to be a conjugate inverse-γ distribu2 tion, p ( σ γ ) ∝ IG[( v0 / 2),( v0 / 2) . When ν0 = 0 and ν0 = 0, we obtain Jeffrey’s uninformative prior, i.e., p(σ2)1/σ2. The Bayesian variable selection using a binomial probit regression model is discussed in Lee et al. (2003), in which it is assumed that σ2 = 1. Moreover, { γ j }nj =1 are assumed to be independent with p(γj =1)=πj, j = 1, …, n in which πj is the probability to select motif j. Obviously, if we want to select 10 motifs from all n motifs, then πj may be set as 10/n. In this article, we empirically set πj = 10/n for all motifs, based on the total sample number m = 30. If πj was chosen to take a larger value, then we found that often times (XγT Xγ ) − was singular.

− m/ 2

− nγ / 2

)

(

) (

p β y , X γ , σ 2 ∝ N Vγ X γT y , σ 2 , Vγ where

Vγ ∆

(

c XT X 1+ c γ γ

)

−1

(A4)

)

(A5) (A6)

Bayesian Variable Selection for Probit Regession For the special case of K = 2, the relationship between the gene expression level yg and the motif scores of the combination of TFBSs Xg is modeled as a standard binomial probit regressor (Albert and Chib, 1993). Please see Lee et al. (2003) for detailed derivation of the posterior distributions. When K > 2, the situation is different from the binomial case because we have to construct K – 1 regression equations similar to Eq. 8. Introduce K – 1 latent variables z1, …, zK-1 and K – 1 regression equations such that zk = Xβk + ek, k = 1 …, K–1 in which ek~N(0,1).

Neuroinformatics_________________________________________________________________ Volume 4, 2006

Bayesian Variable Selection for Gene Expression Modeling _________________________________________115

Let z k take m values {z k,1, …, z k,m}. Denote zk = {zk,1, …, zk,m}T and ek = {ek,1, …, ek,m}T. Then it can be rewritten as a vector form: zk = Xβk + ek, k=1, …, K – 1

which makes it difficult to estimate the parameters in Eq. A7. Here we discuss how to select the same strongest motifs for the different regression equations. The model is a little different from Eq. A7, i.e., the selected motifs do not change with the different regression equations. Note that the parameter β is still dependent on k and γ, denoted by bk,g. Then Eq. A7 is rewritten as zk = Xgbk,g + ek, k=1, …, K – 1

(A8)

in which Xγ means the column of X corresponding to those elements of γ that are equal to 1, and similar annotation is applicable for βk,γ. Now the problem is how to estimate γ and the corresponding βk,γ and zk for each equation in Eq. A8. A Gibbs sampler is employed to estimate all the parameters. Given γ for equation k, the prior distribution of βγ is β γ ~ N [0 , c( X γT X γ )−1 (Lee et al.,2003), in which c is a constant. We set c = 10 in this study. The detailed derivation of the posterior distributions of the parameters is given in (Lee et al., 2003). Denote

(

c T z X XT X c +1 k γ γ γ X γT zk , k = 1, K,, K − 1

)

−1

(A9)

By straightforward computing, the posteriori distribution p(g|z1, …, zK–1) is approximated by

(

) (

)



(k −1)Mγ

p γ z1 , K , zk −1 ∝ p z1 , K , zk −1 γ p( γ )

2

⎧ 1 K −1 ⎫ exp ⎨− ∑ S ( γ , zk )⎬ ⎩⎪ 2 k =1 ⎭⎪ ×

(A7)

This model is called the multinomial probit model. For background on multinomial probit models, see (Imai and van Dyk, 2003). Note that K −1 we do not have the observations of {zk }k = 1 ,

S ( γ , zk ) = zTk zk −

∝ (1 + c )

n

∏ π hγ h (1 − π h )

1− γ h

,

(A10)

j =1

(

)

and the posterior distribution p bk , g zk is

(

)

given by βk , γ zk , Xγ ~ N V γ XγT zk ,V γ , in which Vγ =

(

c XT X 1+ c γ γ

)

−1

References Albert, J. and Chib, S. (1993) Bayesian analysis of binary and polychotomous response data. J. Am. Stat. Assoc. 88, 669–679. Azizkhan, J. C., Jensen, D. E., Pierce, A. J., and Wade, M. (1993) Transcription from TATA-less promoters: dihydrofolate reductase as a model. Crit. Rev. Eukaryot. Gene Expr. 3, 229–254. Bailey, T. L. and Elkan, C. (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol. 2, 28–36. Bailey, T. L. and Noble, W. S. (2003) Searching for statistically significant regulatory modules. Bioinformatics. 19(Suppl 2), II16–II25. Bigger, C. B., Melnikova, I. N., and Gardner, P. D. (1997) Sp1 and Sp3 regulate expression of the neuronal nicotinic acetylcholine receptor beta4 subunit gene. J. Biol. Chem. 272, 25,976–25,982. Birnbaum, M. J., van Wijnen, A. J., Odgren, P. R., et al. (1995) Sp1 trans-activation of cell cycle regulated promoters is selectively repressed by Sp3. Biochemistry 34, 16,503–16,508. Bussemaker, H. J., Li, H., and Siggia, E. D. (2001) Regulatory element detection using correlation with expression. Nat. Genet. 27, 167–171. Chen, Q. K., Hertz, G. Z., and Stormo, G. D. (1995) MATRIX SEARCH 1.0: Acomputer program that scans DNA sequences for transcriptional elements using a database of weight matrices. Comput. Appl. Biosci. 11, 563–566. Conlon, E. M., Liu, X. S., Lieb, J. D., and Liu, J. S. (2003) Integrating regulatory motif discovery and genome-wide expression analysis. Proc. Natl. Acad. Sci. USA. 100, 3339–3344. Ensemble (2004) Project Ensemble (http://www. ensembl.org/).

Volume 4, 2006_________________________________________________________________ Neuroinformatics

116 __________________________________________________________________________________Liu et al. Geiger, A., Salazar, G., and Kervran, A. (2001) Role of the Sp family of transcription factors on glucagon receptor gene expression. Biochem. Biophys. Res. Commun. 285, 838–844. Grabe, N. (2002) AliBaba2: Context specific identification of transcription factor binding sites. In Silico Biol. 2, S1–S15. Grundy, W. N., Bailey, T. L., Elkan, C. P., and Baker, M. E. (1997) Meta-MEME: motif-based hidden Markov models of protein families. Comput. Appl. Biosci. 13, 397–406. Hagen, G., Muller, S., Beato, M., and Suske, G. (1994) Sp1-mediated transcriptional activation is repressed by Sp3. EMBO J. 13, 3843–3851. Imai, K. and van Dyk, D. A. (2003) A Bayesian analysis of the multinomial probit model using marginal data augmentation. J. Econometrics 124, 311–334. Kechris, K. J., van Zwet, E., Bickel, P. J., and Eisen, M. B. (2004) Detecting DNA regulatory motifs by incorporating positional trends in information content. Genome Biol. 5, R50. Kel, A. E., Gossling, E., Reuter, I., Cheremushkin, E., Kel-Margoulis, O. V., and Wingender, E. (2003) MATCH: A tool for searching transcription factor binding sites in DNA sequences. Nucl. Acids Res. 31, 3576–3579. Kel, A., Kel-Margoulis, O., Babenko, V., and Wingender, E. (1999) Recognition of NFATp/ AP-1 composite elements within genes induced upon the activation of immune cells. J. Mol. Biol. 288, 353–376. Keles, S., van der Laan, M., and Eisen, M. B. (2002) Identification of regulatory elements using a feature selection method. Bioinformatics 18, 1167–1175. Keles, S., van der Laan, M. J., and Vulpe C. (2004) Regulatory motif finding by logic regression. Bioinformatics 20, 2799–2811. Lee, K. E., Sha, N., Dougherty, E. R., Vannucci, M., and Mallick, B. K. (2003) Gene selection: a Bayesian variable selection approach. Bioinformatics 19, 90–97. Liu, X. S., Brutlag, D. L., and Liu, J. S. (2002) An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat. Biotechnol. 20, 835–839. Majello, B., De Luca, P., and Lania, L. (1997) Sp3 is a bifunctional transcription regulator with modular independent activation and repression domains. J. Biol. Chem. 272, 4021–4026. Matys, V., Fricke, E., Geffers, R., et al. (2003) TRANSFAC: transcriptional regulation, from patterns to profiles. Nucl. Acids Res. 31, 374–378.

NCBI (2004) National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/). Noti, J. D. (1997) Sp3 mediates transcriptional activation of the leukocyte integrin genes CD11C and CD11B and cooperates with c-Jun to activate CD11C. J. Biol. Chem. 272, 24,038–24,045. Pan, J. Z., Jornsten, R., and Hart, R. P. (2004) Screening anti-inflammatory compounds in injured spinal cord with microarrays: A comparison of bioinformatics analysis approaches. Physiol. Genom. 17, 201–214. Popovich, P. G. and Jones, T. B. (2003) Manipulating neuroinflammatory reactions in the injured spinal cord: back to basics. Trends Pharmacol. Sci. 24, 13–17. Prestridge, D. S. (1996) SIGNALSCAN 4.0: Additional databases and sequence formats. Comput. Appl. Biosci. 12, 157–160. Quandt, K., Frech, K., Karas, H., Wingender, E., and Werner, T. (1995) MatInd and MatInspector: New fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucl. Acids Res. 23, 4878–4884. Rajakumar, R. A., Thamotharan, S., Menon, R. K., and Devaskar, S. U. (1998) Sp1 and Sp3 regulate transcriptional activity of the facilitative glucose transporter isoform-3 gene in mammalian neuroblasts and trophoblasts. J. Biol. Chem. 273, 27,474–27,483. Rajakumar, R. A., Thamotharan, S., Raychaudhuri, N., Menon, R. K., and Devaskar, S. U. (2004) Trans-activators regulating neuronal glucose transporter isoform-3 gene expression in mammalian neurons. J. Biol. Chem. 279, 26,768–26,779. RGSC (2004) Rat Genome Sequencing Consortium (http://www.hgsc.bcm.tmc.edu/projects/rat/ assembly.html). Robert, C. (1995) Simulation of truncated normal variables. Stat. Comput. 5, 121–125. Ross, S., Tienhaara, A., Lee, M. S., Tsai, L. H., and Gill, G. (2002) GC box-binding transcription factors control the neuronal specific transcription of the cyclin-dependent kinase 5 regulator p35. J. Biol. Chem. 277, 4455–4464. Smith, M. and Kohn, R. (1997) Nonparametric regression using Bayesian variable selection. J. Econometry 75, 317–344. Supp, D. M., Witte, D. P., Branford, W. W., Smith, E. P., and Potter, S. S. (1996) Sp4, a member of the Sp1-family of zinc finger transcription factors, is required for normal murine growth, viability, and male fertility. Dev. Biol. 176, 284–299.

Neuroinformatics_________________________________________________________________ Volume 4, 2006

Bayesian Variable Selection for Gene Expression Modeling _________________________________________117 Tadesse, M. G., Vannucci, M., and Lio, P. (2004) Identification of DNA regulatory motifs using Bayesian variable selection. Bioinformatics 20, 2553–2561. Tompa, M., Li, N., Bailey, T. L., et al. (2005) Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol. 23, 137–144. Troyanskaya, O., Cantor, M., Sherlock, G., et al. (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17, 520–525. van Helden, J., Andre, B., and Collado-Vides, J. (1998) Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol. 281, 827–842.

Yoo, J., Jeong, M. J., Kwon, B. M., Hur, M. W., Park, Y. M., and Han, M. Y. (2002) Activation of dynamin I gene expression by Sp1 and Sp3 is required for neuronal differentiation of N1E-115 cells. J. Biol. Chem. 277, 11,904–11,909. Zhou, X., Wang, X., and Dougherty, E. R. (2003a) Binarization of microarray data based on a mixture model. Mol. Cancer Ther. 2, 679–684. Zhou, X., Wang, X., and Dougherty, E. R. (2003b) Missing-value estimation using linear and nonlinear regression with Bayesian gene selection, Bioinformatics 19, 2302–2307. Zhou, X., Wang, X., and Dougherty, E. R. (2004) Gene prediction using multinomial probit regression with the Bayesian variable selection. EURASIP J. Appl. Signal Proc. 3, 115–124.

Volume 4, 2006_________________________________________________________________ Neuroinformatics

Suggest Documents