Checking Consistency Between Expression Data and Large Scale ...

1 downloads 0 Views 764KB Size Report
May 2, 2007 - aroF aroG aroP tyrA tyrB. xapR .... [10] H Salgado, S Gama-Castro, M Peralta-Gil, E Diaz-Peredo, F Sanchez-Solano, A Santos-. Zavaleta ...
INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE

Checking Consistency Between Expression Data and Large Scale Regulatory Networks: A Case Study Carito Guziolowski — Philippe Veber — Michel Le Borgne — Ovidiu Radulescu — Anne

N° 6190 May 2007

apport de recherche

ISRN INRIA/RR--6190--FR+ENG

Thème BIO

ISSN 0249-6399

inria-00145537, version 2 - 15 May 2007

Siegel

inria-00145537, version 2 - 15 May 2007

Checking Consistency Between Expression Data and Large Scale Regulatory Networks: A Case Study Carito Guziolowski∗ , Philippe Veber , Michel Le Borgne , Ovidiu Radulescu , Anne Siegel Th`eme BIO — Syst`emes biologiques Projet Symbiose

inria-00145537, version 2 - 15 May 2007

Rapport de recherche n 6190 — May 2007 — 14 pages

Abstract: We proposed in previous articles a qualitative approach to check the compatibility between a model of interactions and gene expression data. The purpose of the present work is to validate this me thodology on a real-size setting. We study the response of Escherichia coli regulatory network to nutritional stress, and compare it to publicly available DNA microarray experiments. We show how the incompatibilities we found reveal missing interactions in the network, as well as observations in contradiction with available literature. Key-words: Regulatory networks, microarray experiments, qualitative modeling



Contact: [email protected]

Unité de recherche INRIA Rennes IRISA, Campus universitaire de Beaulieu, 35042 Rennes Cedex (France) Téléphone : +33 2 99 84 71 00 — Télécopie : +33 2 99 84 71 71

V´ erification de la Coh´ erence entre Donn´ ees d’expression et R´ eseaux de R´ egulation ` a Grande Echelle: Un cas d’´ etude R´ esum´ e : On a propos´e dans les articles pr´ec´edents une approche qualitative pour tester la compatibilit´e entre un mod`ele d’interactions et des donn´es d’expressions. L’objectif du pr´esent travail est de valider cette m´ethodologie avec un probl`eme de taille r´eelle. On a ´etudi´e la r´eponse du r´eseau de r´egulation de la bactrie Escherichia coli au stress nutritionnel, et on l’a compar´e avec des donn´ees de puces a ` ADN disponibles. On montre comment les incompatibilit´es trouv´ees indiquent des interactions manquantes, ainsi que des donn´ees observ´ees contradictoires avec la litt´erature disponible.

inria-00145537, version 2 - 15 May 2007

Mots-cl´ es : R´eseaux de r´egulation, puces a ` ADN, mod´elisation qualitative

Consistency Between Expression Data and Large Scale Regulatory Networks

inria-00145537, version 2 - 15 May 2007

1

3

Introduction

There exists a wide range of techniques for the analysis of gene expression data. Following a review by Slonim [13], we may classify them according to the particular output they compute: 1. list of significantly over/under-expressed genes under a particular condition, 2. dimension reduction of expression profiles for visualization, 3. clustering of co-expressed genes, 4. classification algorithms for protein function, tissue categorization, disease outcome, 5. inferred regulatory networks. The last category may be extended to all model-based approaches, where experimental measurements are used to build, verify or refine a model of the system under study. Following this line of research, we showed in previous papers (see [9], [12] and [14]) how to define and to check consistency between experimental measurements and a graphical regulatory model formalized as an interaction graph. The purpose of the present work is to validate this methodology on a real-size setting. More precisely, we show 1. that the algorithms we proposed in [14] are able to handle models with thousands of genes and reactions, 2. that our methodology is an effective strategy to extract biologically relevant information from gene expression data. For this we built an interaction graph for the regulatory network of E. coli K12, mainly relying on the highly accurate database RegulonDB [10], [11]. Then we compared the predictions of our model with three independant microarray experiments. Incompatibilities between experimental data and our model revealed: either expression data that is not consistent with results showed in literature – i.e. there is at least one publication which contradicts the experimental measurement, either missing interactions in the model We are not the first to address this issue. Actually, in the work of Gutierrez-Rios and coworkers [6], an evaluation of the consistency between literature and microarray experiments of E. coli K12 was presented. The authors designed on-purpose microarray experiments in order to measure gene expression profiles of the bacteria under different conditions. They evaluate the consistency of their experimental results first with those reported in the literature, second with a rule-based formalism they propose. Our main contribution is the use of algorithmic tools that allow inference/prediction of gene expression of a big percentage of the network, and diagnosis in the case of inconsistency between a model and expression data.

2 2.1

Mathematical framework Introductory example

We choose as an illustration a model for the lactose metabolism in the bacterium E.Coli (lactose operon). The interaction graph corresponding to the model is presented in Fig.1. This is a common representation for biochemical systems where arrows show activation

RR n 6190 

4

Guziolowski & al.

or inhibition. Basically, an arrow between A and B means that an increase of A tends to increase or decrease B depending on the shape of the arrow head. Common sense and simple biological intuition can be used to say that an increase of allolactose (node A on Figure 1) should result in a decrease of LacI protein. However, if both LacI and cAM P − CRP increase, then nothing can be said about the variation of LacY . The aim of this section is first, to provide a formal interpretation for the graphical notation used in Figure 1; second, to derive constraints on experimental measurements, which justify our small scale common sense reasoning; finally apply these constraints to the scale of data produced by high throughput experimental techniques. For this, we resort to qualitative modeling ([8]), which may be seen as a principled way to derive a discrete system from a continuous one.

inria-00145537, version 2 - 15 May 2007

2.2

Equilibrium shift of a differential system

Let us consider a network of n interacting cellular constituents (mRNA, protein, metabolite). We denote by Xi the concentration of the ith species, and by X the vector of concentrations (whose components are Xi ). We assume that the system can be adequately described by a system of differential equations of the form dX dt = F(X, P), where P denotes a set of control parameters (inputs to the system). A steady state of the system is a solution of the system of equations F(X, P) = 0 for fixed P. A typical experiment consists in applying a perturbation (change P) to the system in a given initial steady state condition eq1, wait long enough for a new steady state eq2, and record the changes of Xi . Thus, we shall interpret the sign of DNA chips differential data as the sign of the variations Xieq2 − Xieq1 . The particular form of vector function F is unknown in general, but this will not be needed as we are interested only in the signs of the variations. Indeed, the only information ∂Fi we need about F is the sign of its partial derivatives ∂X . We call interaction graph the j graph whose nodes are the constituents {1, . . . , n}, and where there is an edge j → i iff ∂Fi ∂Xj 6= 0 (an arrow j → i means that the rate of production of i depends on X j ). As soon ∂Fi ∂Xj may depend on ∂Fi ∂Xj is constant, that

as F is non linear,

the actual state X. In the following, we will assume

is, that the interaction graph is independent of the that the sign of state. This rather strong hypothesis, can be replaced by a milder one specified in [9, 12] meaning essentially that the sign of the interactions do not change on a path of intermediate states connecting the initial and the final steady states.

2.3

Qualitative constraints

In the following, we introduce an equation that relates the sign of variation of a species to that of its predecessors in the interaction graph. To state this result with full rigor, we need to introduce the following algebra on signs. We call sign algebra the set {+, , ?} (where ? stands for indeterminate), endowed with addition, multiplication and qualitative equality, defined as:

INRIA

5

Consistency Between Expression Data and Large Scale Regulatory Networks

++ ?+

=? =?

+++=+ ?++ = ?

+ = ?+?=?

+× ?×

= =?

+×+=+ ?×+=?

≈ × =+ + ?×?=? ?

+ T F T

? T T T

F T T

Some particularities of this algebra deserve to be mentioned: the sum of + and

is indeterminate, as is the sum of anything with indeterminate,

inria-00145537, version 2 - 15 May 2007

qualitative equality is reflexive, symmetric but not transitive, because ? is qualitatively equal to anything; this last property is an obstacle against the application of classical elimination methods for solving linear systems. To summarize, we consider experiments that can be modelled as an equilibrium shift of a differential system under a change of its control parameters. In this setting, DNA chips provide the sign of variation in concentration of many (but not necessarily all) species in the network. We consider the signs s(Xieq2 − Xieq1 ) of the variation of some species i between the initial state X eq1 and the final state X eq2 . Both states are stationary and unknown. In [9], we proved that under some reasonable assumptions, in particular if the sign of ∂Fi ∂Xj is constant in states along a path connecting eq1 and eq2, then the following relation holds in sign algebra for all species i: s(Xieq2 − Xieq1 ) ≈

X

j∈pred(i)

s(

∂Fi )s(Xjeq2 − Xjeq1 ) ∂Xj

(1)

where s : R → {+, } is the sign function, and where pred(i) stands for the set of predecessors of species i in the interaction graph. This relation is similar to a linearization of the system F(X, P) = 0. Note however, that as we only consider signs and not quantities, this relation is valid even for large perturbations (see [9] for a complete proof).

2.4

Analyzing a network: a simple example

Let us now describe a practical use of these results. Given an interaction graph, say for instance the graph illustrated in Figure 1, we use Equation 1 at each node of the graph to build a qualitative system of constraints. The variables of this model are the signs of variation for each species. The qualitative system associated to our lactose operon model is proposed in the right side of Figure 1. In order to take into account observations, measured variables should be replaced by their sign values. A solution of the qualitative system is defined as a valuation of its variables, which does not contain any ”?” (otherwise, the constraints would have a trivial solution with all variables set to ”?”) and that, according to the qualitative equality algebra, will satisfy all qualitative constraints in the system. If the model is correct and if data is accurate, then the qualitative system must posses at least one solution. A first step then is to check the self-consistency of the graph, that is to find if the qualitative system without observations has at least one solution. Checking consistency

RR n 6190 

6

Guziolowski & al.

                  

LacI A LacZ Li G cAM P LacY

≈ ≈ ≈ ≈ ≈ ≈ ≈

−A LacZ cAM P − LacI Le + LacY − LacZ Li + LacZ −G cAM P − LacI

(1) (2) (3) (4) (5) (6) (7)

Figure 1: Interaction graph for the lactose operon and its associated qualitative system. In the

inria-00145537, version 2 - 15 May 2007

graph, arrows ending with ”>” or ”−|” imply that the initial product activates or represses the production of the product of arrival, respectively.

between experimental measurements and an interaction graph boils down to instantiating the variables which are measured with their experimental value, and see if the resulting system still has a solution. If this is the case, then it is possible to determine if the model predicts some variations. Namely, it happens that a given variable has the same value in all solutions of the system. We call such variable a hard component. The values of the hard components are the predictions of the model. Whenever the system has no solution, a simple strategy to diagnose the problem is to isolate a minimal set of inconsistent equations. In our experiments, a greedy approach was enough to solve all inconsistencies (see next section). Note that in our setting isolating a subset of the equations is equivalent to isolating a subgraph of the interaction graph. The combination of the diagnosis algorithm and a visualization tool is particularly useful for model refinement. Finally, let us mention that we provided in [14] an efficient representation of qualitative systems, leading to effective algorithms, some of them could be used to get further insights into the model under study. We shall see in the next section, that these algorithms are able to deal with large scale networks.

3 3.1

Results Construction of the Escherichia coli regulatory network

For building E.coli regulatory network we relied on the transcriptional regulation information provided by RegulonDB ([10], [11]) on March 2006. From the file containing transcription factor to gene interactions we have built the regulatory network of E.coli as a set of interactions of the form A → B sign where sign denotes the value of the interaction: +, , ?(expressed, repressed, undetermined), and A and B can be considered as genes or proteins, depending on the following situations:

INRIA

Consistency Between Expression Data and Large Scale Regulatory Networks

7

The interaction genA → genB was created when both genA and genB are notified by RegulonDB, and when the protein A, synthesized by genA, is among the transcriptional factors that regulate genB. See Figure 2 A. The interaction T F → genB was created when we found TF as an heterodimer protein (protein-complex formed by the union of 2 proteins) that regulates genB. See Figure 2 B. In E.coli transcriptional network we have found 4 protein-complexes which are: IHF, HU, RcsB, and GatR. The interaction genA → T F was created when we found the transcriptional factor TF as an heterodimer protein and genA synthesizes one of the proteins that form TF. See Figure 2 B. A

B

IhfB

inria-00145537, version 2 - 15 May 2007

Fur fur

IHF fiu

IhfA aceA

Figure 2: Representation of genetical interactions. (A) Negative regulation (repression) of gene f iu by the transcription factor F ur represented as f ur → f iu −. (B) Biological interaction of genes ihf A and ihf B forming the protein-complex IHF represented as ihf A → IHF + and ihf B → IHF +, positive regulation of gene aceA by the protein complex IHF represented by IHF → aceA +

3.2

Adding sigma factors to obtain self-consistency

Using the methods and the algorithms described with detail in [14] we built a qualitative system of equations for the interaction graph obtained from E.coli network. For solving qualitative equations we have used our own tool, the PYTHON module PYQUALI. The system was not found to be self-consistent and we used a procedure available in PYQUALI library to isolate a minimal inconsistent subgraph (see Figure 3). A careful reading of the available literature led us to consider the regulations involving sigma factors which were initially absent from the network. Once added to complete the network, we obtained a network of 3883 interactions and 1529 components (genes, protein-complexes, and sigmafactors). This final network (global network) was found to be self-consistent.

3.3

Compatibility of a network with a set of observations

A compatible network can be tested with different sets of observations of varied stresses: thermal, nutritional, hypoxic, etc. An observation is a pair of values of the form gene = sign where sign can be + or indicating that the gene is expressed or respectively repressed

RR n 6190 

8

Guziolowski & al.

      

ihfA

ihfA

rpoD

ihfB

rpoS

IHF ≈ ihfA + ihfB (1)

     IHF

IHF

   

ihfB

ihfA ≈ −IHF + rpoS + rpoD (2) ihfB ≈ −IHF + rpoS + rpoD (3)

Figure 3: (Left) A minimal inconsistent subgraph, isolated from the whole E.coli regulatory network using PYQUALI. (Right) Correction proposed after careful reading of available literature on ihfA and ihfB regulation.

inria-00145537, version 2 - 15 May 2007

Table 1: Table of the 40 variations of products observed under stationary growth phase condition. Source: RegulonDB March 2006 gene acnA acrA adhE appB appC appY blc bolA

variation + + + + + + + +

gene csiE cspD dnaN dppA fic gabP gadA gadB

variation + + + + + + + +

gene gadC hmp hns hyaA ihfA ihfB lrp mpl

variation + + + + − − + +

gene osmB osmE osmY otsA otsB polA proP proX

variation + + + + + + + +

gene recF rob sdaA sohB treA yeiL yfiD yihI

variation + + − − + + + −

under certain condition. To test the global network of E. coli, we have chosen a set of 40 observations for the stationary phase condition provided by RegulonDB (Table 1). The set of 40 observations of the stationary phase was found to be inconsistent with the global network of E. coli. We found a direct inconsistency in the system of equations caused by the values fixed by the observations given to ihfA and ihfB: {ihf A = , ihf B = }, implying repression of these genes under stationary phase. This mathematical incompatibility agreed with the literature related to genes ihf A and ihf B expression under stationary growing phase. Studies [1],[2],[4],[15] agree that transcription of ihf A and ihf B increases during stationary phase. Supported by this information, we have modified the observations of ihf A and ihf B and the compatibility test of the global network of E.coli was successful.

3.4

Predictions over a compatible network from a set of observations

As mentioned earlier, a regulatory network is said to be consistent with a given set of observations when the associated qualitative system has at least one solution. If a variable is fixed to the same value in all solutions, then mathematically we are talking about a hard component, which is a prediction or inference for this set of observations. We have mentioned that the regulatory network including sigma factors is consistent with the set of 40 observations for stationary phase, after some correction. Actually there are about 2, 66 · 1016 solutions of the qualitative system which are consistent with the 40 observations of stationary phase. Furthermore, in all these solutions, 381 variables of the system have always the same value (they are hard components, see Figure 4). In other words, we were

INRIA

9

Consistency Between Expression Data and Large Scale Regulatory Networks

rcA

evgA

argR

argB

argC

argD

argE

argF

argH

yhbC

emrK

argI

nupC

argG

sufA

sufB

sufC

carA

sufD

carB

sufE

cspE

evgS

cspA

sufS

sucA

srlB

sucB

sucC

malZ

sucD

lamB

srlE

malE

srlD

malF

malK

soxS

cpxA

recF

malG

soxR

malM

lyxK

malT

malP

malQ

cpxP

gntR

cusR

cpxR

fldA

cusA

dsbA

rdoA

baeR

ung

nfnB

hns

putA

metC

fpr

cusC

fldB

marA

malS

cusB

inaA

acrD

nfo

cusF

nfsA

entF

cynR

cusS

ribA

mdtB

cynS

cynT

rimK

entS

cysB

cynX

ybjC

exbB

exbD

poxB

cysA

cysC

cysD

cysH

cysI

ybjN

fepB

gntK

mdtC

mdtD

baeS

marR

mdtA

fepC

acrA

mhpR

ppiA

cysJ

gntU

marB

cysN

cysP

cysU

cysW

idnR

edd

tauB

asnA

cysM

cbl

fepD

acrB

cysK

metF

tauC

metI

tauD

metK

tauA

metN

fepE

metQ

fepG

metB

metE

metJ

metL

metH

gcvR

metR

glyA

gcvB

metA

cvpA

glnB

hflD

prsA

idnK

gcvA

purB

yodA

purC

cytR

purD

fes

purE

fhuA

purF

fhuB

purH

fhuC

purK

idnO

fhuD

fumC

idnT

fhuF

zwf

mhpA

idnD

nohA

purL

glcC

betI

sraI

purM

tonB

purN

pyrC

yhhY

speA

fur

purR

speB

betB

ubiX

pyrD

mhpB

gcvH

betT

gcvP

betA

gcvT

mhpD

rhaS

galR

ahpC

mhpC

cadC

rhaR

udp

cirA

ahpF

ilvY

nrdE

codA

mhpE

cdd

nrdF

glcA

entA

glcB

nrdH

codB

glcD

nrdI

glcE

ygaC

mhpF

glcF

entB

arcA

ilvC

iscR

iscA

lexA

iscS

iscU

dinF

dinG

dnaG

glcG

fhuE

ftsK

polB

ydeA

entC

entD

entE

fepA

fiu

infB

metY

nusA

pnp

cadB

uvrC

uvrY

ydjM

dinQ

hcaR

rbfA

rpsO

galS

cadA

tisB

treB

ybdB

gyrA

sodA

truB

lacA

insK

treC

pgmA

lacY

ulaR

molR_1

glpD

phr

fumA

nupG

rhaT

lacZ

fecR

chiA

ruvA

mukB

lysR

ruvB

tsx

tisA

recA

paaJ

hcaB

hcaC

pqiA

mukE

paaK

aldA

paaI

hcaD

pqiB

recN

recX

paaH

hcaE

rpsU

paaG

hcaF

ssb

paaF

sulA

paaE

umuC

paaD

umuD

paaC

uvrA

paaB

uvrB

paaA

uvrD

gltA

mngR

lysA

moeB

moeA

oxyR

mtr

mngA

ulaA

ulaB

ulaC

ulaD

emrR

emrA

phoB

mngB

ulaE

paaX

ulaF

fucR

frdA

ptsH

asr

frdB

ptsI

b4103

frdC

phnC

frdD

phnD

crr

phnE

focA

phnF

pflB

phnG

glnA

phnH

glnL

phnI

nagA

phnJ

uxuA

dcuR

uxuB

gutM

argP

appB

trpR

phnK

fdnG

fdnH

fdnI

phnL

phnM

hyaB

hyaC

gutQ

emrB

phnN

phnO

hyaD

srlA

phnP

hyaE

nrdA

hyaF

phoA

phoE

aceB

nrdB

aceK

polA

guaA

guaB

phoH

phoR

norV

norW

fumB

psiF

hypB

dnaN

kdgR

hypC

malI

melR

rob

aroH

grxA

sgbH

trpA

fucA

sgbU

ptsG

lpdA

acs

trpB

fucI

dusB

trpC

fucK

fadL

trpD

fucO

yjcG

trpE

fucP

yjcH

tyrR

fucU

yiaO

trpL

bacA

nuoA

aroL

dsbC

nuoB

aroM

ecfI

nuoC

yaiA

htrA

nuoE

aroF

psd

nuoF

xapR

aroG

smpA

nuoG

aroP

tyrA

srlR

nuoH

nuoI

tyrB

xapA

dctA

nuoJ

nuoK

dsdC

xapB

rbsR

gntT

dnaA

nuoL

nuoM

eda

nuoN

tdcD

gntX

phoU

tdcE

gntY

agn43

tdcB

mukF

cyaA

pstA

tdcF

glgA

katG

tdcG

glgC

trxC

gltD

glgP

pstB

gltF

dsdA

pstC

gltB

psiE

malX

yeiL

birA

crp

dsdX

pstS

rpoD

ugpA

ugpB

malY

acrR

ugpC

melA

asnC

ugpE

melB

tyrP

ugpQ

rbsA

glpR

rbsB

rbsC

lacI

smtA

fnr

rbsD

rbsK

fecI

fimA

fecA

glpK

alaT

fecB

fecC

alaW

fecD

galP

alaX

fecE

argU

fimC

argW

fimD

argX

fimF

aspV

fimG

mntH

fimH

csgD

fimI

glnU

csgE

csgF

glnV

glnW

glnX

glpX

gltV

glpF

glyT

glpQ

glyU

gyrB

glpT

hisR

gntP

ileT

dps

leuP

hpt

leuQ

spf

nagB

leuT

nagD

leuV

leuW

nagE

glmS

sgbE

glmU

leuX

acnA

lysT

lysV

yiaK

trxA

agp

paaZ

dcuB

ompR

yiaL

yiaM

yiaN

pgk

glpE

yiaJ

rhaA

glpG

appY

rhaB

rhaD

b0725

lctP

lctD

ackA

nmpC

sdhA

sdhB

sdhC

sdhD

hlyE

galM

prpR

lctR

csgG

micF

norR

galT

sigma19

xylR

sodB

bglB

galE

galK

bglF

bglG

chbR

hupA

xylA

xylB

xylG

dcuC

chbC

xylH

xylF

chbF

HU

pgm

chbG

appC

chbA

chbB

lysW

mazE

appA

mazF

hemA

metT

metU

tnaA

pheU

araC

tnaB

pheV

hyfR

ilvB

tnaC

proK

ilvN

agaR

ivbL

mpl

fis

proL

proM

rrfA

rrfE

accA

rrlE

deoA

deoB

deoC

prpD

deoD

rrfB

rrfC

accC

ebgR

accD

ampC

sohB

prpC

rrlA

accB

rrfD

nagC

prpB

acnB

prpE

araA

araB

hupB

rrfF

amyA

araD

queA

rrfG

rrfH

rrlB

rpe

ompA

araE

aroK

putP

araH

aroB

speC

araF

gph

ubiG

araJ

xseA

araG

yhfA

rrlG

seqA

rrlH

rrnB

rrnC

rrnD

trpS

cpdB

arsC

arsR

arsB

narG

aldB

rrnG

adhE

rrnH

asnT

asnU

asnV

atpG

cstA

hyaA

proP

rrlD

dam

nanC

trmA

rrlC

damX

thrV

glnQ

alaU

ygjG

alaV

narH

aidB

gltT

narI

narJ

narK

slyA

gltU

atpH

ileU

ileV

atpC

fadD

uspA

treR

aceA

envZ

epd

csiR

gltW

atpD

atpE

ansB

rrnE

bolA

atpA

aspA

fbaA

rrnA

cspD

nanR

atpF

dcuA

gcd

agaA

manX

serT

serX

csiE

atpI

agaV

manY

thrT

bcp

agaW

tpr

gabP

btuB

ebgA

hybO

thrW

csgB

bglX

agaZ

manZ

thrU

csgA

atpB

cedA

ebgC

chpS

gapA

chpB

clpA

mlc

cls

nanA

cmk

nanE

aceF

fadR

nanK

rpsA

corA

creD

rpoE

creC

tyrU

gabT

tyrV

gabD

valT

ygaF

valU

csiD

valX

valY

creB

csrB

cutA

dapA

dcuD

dut

nanT

cydC

ksgA

slmA

efp

narL

icd

tyrT

creA

pdxA

ompF

serA

dadA

cydD

dadX

hybA

hybB

hybC

hybD

fimB

inria-00145537, version 2 - 15 May 2007

fliA

cheA

cheW

motA

motB

tar

trg

aer

flgK

flgN

mglA

flgM

fliD

fliS

fliT

flxA

fliL

fliM

fliN

fliO

fliP

fliQ

ycgR

yhjH

fliC

mglB

mdh

mglC

fliY

fliZ

glpA

glpB

gloA

gltX

tdcR

hybE

hybF

glyV

pdhR

hybG

ulaG

gor

glyX

narV

glyY

narW

cysT

narY

glyW

narZ

tdcA

leuZ

rimM

gmr

rplE

hypD

hemN

rplF

hypE

hepA

rplM

hycA

fdx

hscA

rplN

hycB

rplO

hycC

hscB

rplR

hycD

ileX

rplS

hycE

infC

rplT

hycF

kdsA

kdtA

rplX

hycG

rpmD

hycH

leuU

rpmJ

lysC

rpsE

hycI

mtlA

mntR

menA

rpsI

menB

rpsN

caiF

fixA

fixB

fixC

fixX

asnB

ybaS

gadX

gadB

cysG

gadC

nirB

nirC

caiA

caiB

caiC

menC

rpsP

yfiD

menE

rpsS

aceE

nikA

mtlD

hdeD

lysP

rpsH

mtlR

flhD

tsr

fxsA

metV

secY

nikR

nikB

metW

nikC

speD

tpx

nikD

metZ

speE

focB

nikE

mfd

ddlB

ftsI

mraY

murD

murE

murG

ftsL

mraZ

mraW

murC

murF

ftsW

yebC

trmD

hyfA

ruvC

narX

hyfB

hyfC

hyfD

hyfE

hyfF

hyfG

hyfH

hyfI

hyfJ

ubiA

ubiC

cyoA

hcp

cyoB

nudB

gatB

paaY

gatC

gatD

hcr

cyoC

pcnB

folK

gatY

pgi

gatZ

yhcH

fadA

cyoD

cyoE

tdcC

pheP

fadB

pntA

iclR

pntB

pth

fadI

ychF

fadJ

relA

rfaI

nrdD

rfaJ

rfaP

uxuR

uidA

uidB

nrdG

lpxA

caiE

hdeA

hdeB

hdfR

lrhA

qseB

rplB

lpxD

rfaS

rfaY

rplC

rfaZ

rplD

skp

rseB

fabZ

rfaK

rplP

rfaB

rplV

rfaG

rplW

rnb

sbmA

rnc

rpmC

acpS

sixA

pdxJ

rpsC

recO

apaH

era

rpsJ

yhbH

rpsQ

ptsN

agaB

npr

agaC

yhbJ

agaD

rplQ

agaI

rpoA

apaG

rpsD

agaS

rpsM

agaY

rpsK

yaeL

serU

ybaB

serV

yeaY

argV

yfeY

argY

yiiS

gnd

argQ

cutC

purA

argZ

ecfA

amiA

serW

ecfD

dppB

ssuC

yggN

dppC

ssuE

ecfG

ssuD

ecfH

ssuB

ecfJ

ssuA

ecfK

thrL

thrA

ecfL

thrB

fkpA

dppD

thrC

tufA

dppF

upp

fusA

hemF

uraA

surA

fruA

valV

lpxP

fruB

valW

rseC

fruK

ves

rseA

pckA

xseB

nlpB

pfkA

dxs

imp

ppsA

yajO

ispA

mdoG

pykF

napA

yebB

rihB

ychA

ychQ

yjaB

hflK

hfq

yjeF

mdoH

napB

yjeE

uhpT

fdhF

torA

ndh

nirD

caiD

rfaQ

torC

torD

napC

napD

napF

miaA

uxaC

napG

exuT

napH

hflX

hflC

uxaB

amiB

uxaA

dmsA

mutL

uhpA

glgS

dmsB

hmp

dmsC

ccmA

ccmB

talA

gatA

ccmC

ccmD

fhlA

alkA

ccmE

ccmF

ccmG

ada

alkB

ccmH

alsA

potF

kbl

cydA

exuR

gdhA

alsB

astC

oppA

astD

alsC

astE

alsR

alsE

astA

lsrR

alsI

potG

glnG

potH

potI

astB

nac

argT

oppB

ddpA

oppC

fabA

ddpB

ddpC

oppD

pspF

ddpD

ddpF

oppF

ddpX

hisJ

tdh

hisM

lsrA

hisP

ilvH

ilvI

hisQ

lsrB

ycdG

livF

lsrC

ycdH

livG

rpoN

lsrD

lsrF

lsrG

ycdI

fabB

ycdJ

livH

topA

ycdK

livJ

ycdL

livK

cydB

livM

ycdM

lysU

modB

yeaG

yeaH

sdaA

modC

yhdW

mgtA

yhdX

uidR

yhdY

yhdZ

amtB

glnK

flhC

narP

gltI

lrp

leuA

gadW

fimE

caiT

glpC

modA

stpA

glnH

leuB

nhaA

ydeO

leuC

proV

leuD

nhaR

aroA

glnP

serC

gltJ

dppA

gltL

pspA

deoR

pspB

torR

pspC

gltK

pspD

nrfA

hypF

nrfB

nrfC

nrfD

nrfE

nrfF

RR n 6190 

gene cpxR crp cusR cynR cysB cytR dnaA dsdC evgA

variation + + + + + + + + +

gene fucR fur galR gcvA glcC gntR ilvY iscR lexA

variation + + + + + + + + +

gene lysR melR mngR oxyR phoB prpR rbsR rhaR rpoD

variation + + + + + + + + +

gene rpoS soxR soxS srlR trpR tyrR

hypA

zraP

zraR

gadE

moaE

cbpA

moaA

tufB

cfa

moaB

narU

moaC

rsd

moaD

otsB

copA

cueO

fabR

rfaC

rfaL

rfaF

leuO

ilvA

ilvD

ilvE

gadA

ilvG_1

atoE

atoC

atoD

bioF

atoB

bioA

hldD

bioB

wcaB

wza

wzb

bioC

ilvG_2

ilvL

ilvM

xthA

IHF

ihfB

ecpD

htrE

osmY

osmE

rtcA

alpA

slp

kdpE

bioD

atoA

lon

nrfG

variation + + + + + +

zraS

yfhK

otsA

rcsA

able to predict the variation: expressed (+) or repressed ( ) of 381 components of our network (25% of the products of the network). We provide a subset of these predictions in Table 2.

variation + + + + + + + + +

puuP

ihfA

proW

fliR

Table 2: Table of 42 products inferred under stationary phase condition.

pspG

ybhK

cueR

ompC

Figure 4: Global E.coli regulatory network with transcriptional and sigma-factors interactions (3883 interactions and 1529 products). Blue and red interactions represent activation or, respectively, repression. Green and blue nodes correspond to positive and negative observations (40). Red nodes (381) are the total inferred variations of products under stationary growth phase condition.

gene IHF ada agaR alsR araC argP argR baeR cadC

pspE

yaiS

leuL

wcaA

hydN

kch

envY

rpoS

proX

hflB

rtcB

rrmJ

ppiD

ybjP

yehW

yehY

yehZ

fruR

yehX

yggE

rtcR

ibpB

yiaG

ldcC

fic

treA

blc

ftsQ

ansP

clpB

clpX

clpP

dnaJ

dpbA

dnaK

gadY

groS

katE

grpE

msyB

hslV

pfkB

hslU

tktB

htgA

uspB

htpG

modE

htpX

rpoH

htrC

ibpA

phoP

pphA

kdpA

borD

hemL

kdpB

mgrB

nadR

kdpC

pagP

pncB

phoQ

nadB

rstA

zntR

zur

gatR_2

gatR_1

zntA

znuC

znuB

GatR

rstB

slyB

ybjG

yrbL

glk

adiY

adiA

rcsB

RcsB

osmC

wzc

bdm

sra

osmB

ftsA

ftsZ

sdiA

yihI

allA

allR

allB

allS

gcl

allC

allD

ylbA

glxK

dhaR

glxR

hyi

ybbV

ybbW

ybbY

dhaK

dhaL

sdaR

dhaM

garD

garK

garL

garP

garR

gudD

gudP

gudX

rnpB

10

3.5

Guziolowski & al.

Validation of the predicted genes

In order to verify whether the 381 predictions obtained from stationary phase data were valid, we have compared them with three sets of microarray data related to the expression of genes of E.Coli during stationary phase. The result obtained is showed in Table 3. The number of compared genes corresponds to the common genes, the validated genes are those genes which variation in the prediction is the same as in the microarray data set. Table 3: Validation of the prediction with microarray data sets Source of microarray data

inria-00145537, version 2 - 15 May 2007

Gutierrez-Rios and co-workers [6], stationary phase Gene Expression Omnibus ([3],[5]), stationary phase after 20 minutes Gene Expression Omnibus ([3],[5]), stationary phase after 60 minutes

Compared genes 249 292 281

Validated genes (%) 34% 51.71% 51.2%

From the sets of microarray data provided by GEO (Gene Expression Omnibus) for stationary phase measured after 20 and 60 minutes, we have taken into account gene expressions whose absolute value is above a specific threshold and compared only these expression data with the 381 predictions. The percentage of validation obtained for different values of thresholds is illustrated in Figure 5. This percentage increases with the threshold, which is normal because stronger variations are more reliable.

Figure 5: (Left) Percentage of validation of the 381 predicted variations of genes with microarray data sets from GEO (Gene Expression Omnibus) for stationary phase after 20 and 60 minutes. For both experiments we validate the 381 predictions with different sets of microarray observations considering only those genes which absolute value of expression is above certain value (threshold). (Right) Number of genes considered for the validation for the different used thresholds of both microarray data sets.

INRIA

Consistency Between Expression Data and Large Scale Regulatory Networks

11

The percentage of our predictions that does not agree with the microarray results is due to: Erroneous microarray indications for certain genes. The genes xthA, cf a, cpxA, cpxR, gor are predicted as expressed by our model and as repressed by the microarray data [6]. Nevertheless, there is strong evidence that they are expressed during the stationary phase (see [7, 16]). Incompleteness of our network model. Our model predicts that the gene ilvC is expressed, which contradicts microarray data. More careful studies [17] document the decrease of the protein IlvC due to an interaction with clpP which is absent in our model. Indeed, under the introduction of a negative interaction between these species, ilvC is no longer a hard component, which lifts the conflict with data.

inria-00145537, version 2 - 15 May 2007

4

Conclusions

Given an interaction graph of a thousand products, such as E.coli regulatory network, we were able to test its self-consistency and its consistency with respect to observations. We have used mathematical methods first exposed in [9, 12, 14]. We have found that the E.coli transcriptional regulatory network, obtained from RegulonDB site [10],[11] is not self consistent, but can be made self-consistent by adding to it sigma-factors which are transcription initiation factors. The self-consistent network (including sigma-factors) is not consistent with data provided by RegulonDB for the stationary growth phase of E.coli. Sources of inconsistency were mistaken observations. Finally, a step of inference/prediction was achieved being able to infer 381 new variations of products (25% of the total products of the network) from E.coli global network (transcriptional plus sigma-factors interactions). This inference was validated with microarray results, obtaining in the best case that 40% of the inferred variations were consistent (37% were not consistent and 23% of them could not be associated to a microarray measure). We have used our approach to spot several imprecisions in the microarray data and missing interactions in our model. This approach can be used in order to increase the consistency between network models and data, which is important for model refinement. Also, it may serve to increase the reliability of the data sets. We plan to use this approach to test different experimental conditions over E.coli network in order to complete its interaction network model. It should be also interesting to test it with different (signed and oriented) regulatory networks. All the tools provided to arrive to these results were packaged in a Python library called PYQUALI which will be soon publicly available. All scripts and data used in this article are available upon simple request to the authors.

RR n 6190 

12

Guziolowski & al.

References [1] T Ali Azam, A Iwata, A Nishimura, S Ueda, and A Ishihama. Growth PhaseDependent Variation in Protein Composition of the Escherichia coli Nucleoid. J Bacteriol, 181(20):6361–6370, August 1991. [2] M Aviv, H Giladi, G Schreiber, AB Oppenheim, and G Glaser. Expression of the genes coding for the Escherichia coli integration host factor are controlled by growth phase, rpoS, ppGpp and by autoregulation. Mol Microbiol, 14(5):1021–1031, Dec 1994. [3] T Barrett, T O Suzek, D B Troup, S E Wilhite, W C Ngau, P Ledoux, D Rudnev, A E Lash, W Fujibuchi, and R Edgar. NCBI GEO: mining millions of expression profiles-database and tools. Nucleic Acids Res, 33:D562–6, Jan 2005.

inria-00145537, version 2 - 15 May 2007

[4] M D Ditto, D Roberts, and R A Weisberg. Growth phase variation of integration host factor level in Escherichia coli. J Bacteriol, 176(12):3738–3748, June 1994. [5] R Edgar, M Domrachev, and A E Lash. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res, 30(1):207–10, Jan 2002. [6] Rosa Maria Gutierrez-Rios, David A Rosenblueth, Jose Antonio Loza, Araceli M Huerta, Jeremy D Glasner, Fred R Blattner, and Julio Collado-Vides. Regulatory network of Escherichia coli: consistency between literature knowledge and microarray profiles. Genome Res, 13(11):2435–2443, Nov 2003. Evaluation Studies. [7] A Ishihama. Functional modulation of Escherichia coli RNA polymerase. Annu Rev Microbiol, 54:499–518, 2000. [8] B Kuipers. Qualitative reasoning. MIT Press, 1994. [9] Ovidiu Radulescu, Sandrine Lagarrigue, Anne Siegel, Philippe Veber, and Michel Le Borgne. Topology and static response of interaction networks in molecular biology. J R Soc Interface, 3(6):185–196, Feb 2006. [10] H Salgado, S Gama-Castro, M Peralta-Gil, E Diaz-Peredo, F Sanchez-Solano, A SantosZavaleta, I Martinez-Flores, V Jimenez-Jacinto, C Bonavides-Martinez, J SeguraSalazar, A Martinez-Antonio, and J Collado-Vides. RegulonDB (version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon organization, and growth conditions. Nucleic Acids Res, 1(34):D394–7, Jan 2006. [11] H Salgado, A Santos-Zavaleta, S Gama-Castro, M Peralta-Gil, MI Penaloza-Spinola, A Martinez-Antonio, P D Karp, and J Collado-Vides. The comprehensive updated regulatory network of Escherichia coli K-12. BMC Bioinformatics, 7(1):5, Jan 2006.

INRIA

Consistency Between Expression Data and Large Scale Regulatory Networks

13

[12] A Siegel, O Radulescu, M Le Borgne, P Veber, J Ouy, and S Lagarrigue. Qualitative analysis of the relation between DNA microarray data and behavioral models of regulation networks. Biosystems, 84(2):153–174, May 2006. [13] Donna K Slonim. From patterns to pathways: gene expression data analysis comes of age. Nat Genet, 32 Suppl:502–508, Dec 2002. [14] Philippe Veber, Michel Le Borgne, Anne Siegel, Sandrine Lagarrigue, and Ovidiu Radulescu. Complex qualitative models in biology: A new approach. Complexus, 2(34):140–151, 2004. [15] A Weglenska, B Jacob, and A Sirko. Transcriptional pattern of Escherichia coli ihfB (himD) gene expression. Gene, 181(1-2):85–8, Nov 1996.

inria-00145537, version 2 - 15 May 2007

[16] D Weichart, R Lange, N Henneberg, and R Hengge-Aronis. Identification and characterization of stationary phase-inducible genes in Escherichia coli. Mol Microbiol, 10(2):405–20, Oct 1993. [17] D Weichart, Querfurth N, Dreger M, and Hengge-Aronis R. Global role for ClpPcontaining proteases in stationary-phase adaptation of Escherichia coli. J Bacteriol, 185(1):115–125, Jan 2003.

RR n 6190 

14

Guziolowski & al.

Contents 1 Introduction

inria-00145537, version 2 - 15 May 2007

2 Mathematical framework 2.1 Introductory example . . . . . . . . . . 2.2 Equilibrium shift of a differential system 2.3 Qualitative constraints . . . . . . . . . . 2.4 Analyzing a network: a simple example

3 . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

3 3 4 4 5

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

3 Results 3.1 Construction of the Escherichia coli regulatory network . . . . . 3.2 Adding sigma factors to obtain self-consistency . . . . . . . . . . 3.3 Compatibility of a network with a set of observations . . . . . . . 3.4 Predictions over a compatible network from a set of observations 3.5 Validation of the predicted genes . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

6 . 6 . 7 . 7 . 8 . 10

4 Conclusions

. . . .

11

INRIA

inria-00145537, version 2 - 15 May 2007

Unité de recherche INRIA Rennes IRISA, Campus universitaire de Beaulieu - 35042 Rennes Cedex (France) Unité de recherche INRIA Futurs : Parc Club Orsay Université - ZAC des Vignes 4, rue Jacques Monod - 91893 ORSAY Cedex (France) Unité de recherche INRIA Lorraine : LORIA, Technopôle de Nancy-Brabois - Campus scientifique 615, rue du Jardin Botanique - BP 101 - 54602 Villers-lès-Nancy Cedex (France) Unité de recherche INRIA Rhône-Alpes : 655, avenue de l’Europe - 38334 Montbonnot Saint-Ismier (France) Unité de recherche INRIA Rocquencourt : Domaine de Voluceau - Rocquencourt - BP 105 - 78153 Le Chesnay Cedex (France) Unité de recherche INRIA Sophia Antipolis : 2004, route des Lucioles - BP 93 - 06902 Sophia Antipolis Cedex (France)

Éditeur INRIA - Domaine de Voluceau - Rocquencourt, BP 105 - 78153 Le Chesnay Cedex (France)

  

  

ISSN 0249-6399

Suggest Documents