INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE
Checking Consistency Between Expression Data and Large Scale Regulatory Networks: A Case Study Carito Guziolowski — Philippe Veber — Michel Le Borgne — Ovidiu Radulescu — Anne
N° 6190 May 2007
apport de recherche
ISRN INRIA/RR--6190--FR+ENG
Thème BIO
ISSN 0249-6399
inria-00145537, version 2 - 15 May 2007
Siegel
inria-00145537, version 2 - 15 May 2007
Checking Consistency Between Expression Data and Large Scale Regulatory Networks: A Case Study Carito Guziolowski∗ , Philippe Veber , Michel Le Borgne , Ovidiu Radulescu , Anne Siegel Th`eme BIO — Syst`emes biologiques Projet Symbiose
inria-00145537, version 2 - 15 May 2007
Rapport de recherche n 6190 — May 2007 — 14 pages
Abstract: We proposed in previous articles a qualitative approach to check the compatibility between a model of interactions and gene expression data. The purpose of the present work is to validate this me thodology on a real-size setting. We study the response of Escherichia coli regulatory network to nutritional stress, and compare it to publicly available DNA microarray experiments. We show how the incompatibilities we found reveal missing interactions in the network, as well as observations in contradiction with available literature. Key-words: Regulatory networks, microarray experiments, qualitative modeling
∗
Contact:
[email protected]
Unité de recherche INRIA Rennes IRISA, Campus universitaire de Beaulieu, 35042 Rennes Cedex (France) Téléphone : +33 2 99 84 71 00 — Télécopie : +33 2 99 84 71 71
V´ erification de la Coh´ erence entre Donn´ ees d’expression et R´ eseaux de R´ egulation ` a Grande Echelle: Un cas d’´ etude R´ esum´ e : On a propos´e dans les articles pr´ec´edents une approche qualitative pour tester la compatibilit´e entre un mod`ele d’interactions et des donn´es d’expressions. L’objectif du pr´esent travail est de valider cette m´ethodologie avec un probl`eme de taille r´eelle. On a ´etudi´e la r´eponse du r´eseau de r´egulation de la bactrie Escherichia coli au stress nutritionnel, et on l’a compar´e avec des donn´ees de puces a ` ADN disponibles. On montre comment les incompatibilit´es trouv´ees indiquent des interactions manquantes, ainsi que des donn´ees observ´ees contradictoires avec la litt´erature disponible.
inria-00145537, version 2 - 15 May 2007
Mots-cl´ es : R´eseaux de r´egulation, puces a ` ADN, mod´elisation qualitative
Consistency Between Expression Data and Large Scale Regulatory Networks
inria-00145537, version 2 - 15 May 2007
1
3
Introduction
There exists a wide range of techniques for the analysis of gene expression data. Following a review by Slonim [13], we may classify them according to the particular output they compute: 1. list of significantly over/under-expressed genes under a particular condition, 2. dimension reduction of expression profiles for visualization, 3. clustering of co-expressed genes, 4. classification algorithms for protein function, tissue categorization, disease outcome, 5. inferred regulatory networks. The last category may be extended to all model-based approaches, where experimental measurements are used to build, verify or refine a model of the system under study. Following this line of research, we showed in previous papers (see [9], [12] and [14]) how to define and to check consistency between experimental measurements and a graphical regulatory model formalized as an interaction graph. The purpose of the present work is to validate this methodology on a real-size setting. More precisely, we show 1. that the algorithms we proposed in [14] are able to handle models with thousands of genes and reactions, 2. that our methodology is an effective strategy to extract biologically relevant information from gene expression data. For this we built an interaction graph for the regulatory network of E. coli K12, mainly relying on the highly accurate database RegulonDB [10], [11]. Then we compared the predictions of our model with three independant microarray experiments. Incompatibilities between experimental data and our model revealed: either expression data that is not consistent with results showed in literature – i.e. there is at least one publication which contradicts the experimental measurement, either missing interactions in the model We are not the first to address this issue. Actually, in the work of Gutierrez-Rios and coworkers [6], an evaluation of the consistency between literature and microarray experiments of E. coli K12 was presented. The authors designed on-purpose microarray experiments in order to measure gene expression profiles of the bacteria under different conditions. They evaluate the consistency of their experimental results first with those reported in the literature, second with a rule-based formalism they propose. Our main contribution is the use of algorithmic tools that allow inference/prediction of gene expression of a big percentage of the network, and diagnosis in the case of inconsistency between a model and expression data.
2 2.1
Mathematical framework Introductory example
We choose as an illustration a model for the lactose metabolism in the bacterium E.Coli (lactose operon). The interaction graph corresponding to the model is presented in Fig.1. This is a common representation for biochemical systems where arrows show activation
RR n 6190
4
Guziolowski & al.
or inhibition. Basically, an arrow between A and B means that an increase of A tends to increase or decrease B depending on the shape of the arrow head. Common sense and simple biological intuition can be used to say that an increase of allolactose (node A on Figure 1) should result in a decrease of LacI protein. However, if both LacI and cAM P − CRP increase, then nothing can be said about the variation of LacY . The aim of this section is first, to provide a formal interpretation for the graphical notation used in Figure 1; second, to derive constraints on experimental measurements, which justify our small scale common sense reasoning; finally apply these constraints to the scale of data produced by high throughput experimental techniques. For this, we resort to qualitative modeling ([8]), which may be seen as a principled way to derive a discrete system from a continuous one.
inria-00145537, version 2 - 15 May 2007
2.2
Equilibrium shift of a differential system
Let us consider a network of n interacting cellular constituents (mRNA, protein, metabolite). We denote by Xi the concentration of the ith species, and by X the vector of concentrations (whose components are Xi ). We assume that the system can be adequately described by a system of differential equations of the form dX dt = F(X, P), where P denotes a set of control parameters (inputs to the system). A steady state of the system is a solution of the system of equations F(X, P) = 0 for fixed P. A typical experiment consists in applying a perturbation (change P) to the system in a given initial steady state condition eq1, wait long enough for a new steady state eq2, and record the changes of Xi . Thus, we shall interpret the sign of DNA chips differential data as the sign of the variations Xieq2 − Xieq1 . The particular form of vector function F is unknown in general, but this will not be needed as we are interested only in the signs of the variations. Indeed, the only information ∂Fi we need about F is the sign of its partial derivatives ∂X . We call interaction graph the j graph whose nodes are the constituents {1, . . . , n}, and where there is an edge j → i iff ∂Fi ∂Xj 6= 0 (an arrow j → i means that the rate of production of i depends on X j ). As soon ∂Fi ∂Xj may depend on ∂Fi ∂Xj is constant, that
as F is non linear,
the actual state X. In the following, we will assume
is, that the interaction graph is independent of the that the sign of state. This rather strong hypothesis, can be replaced by a milder one specified in [9, 12] meaning essentially that the sign of the interactions do not change on a path of intermediate states connecting the initial and the final steady states.
2.3
Qualitative constraints
In the following, we introduce an equation that relates the sign of variation of a species to that of its predecessors in the interaction graph. To state this result with full rigor, we need to introduce the following algebra on signs. We call sign algebra the set {+, , ?} (where ? stands for indeterminate), endowed with addition, multiplication and qualitative equality, defined as:
INRIA
5
Consistency Between Expression Data and Large Scale Regulatory Networks
++ ?+
=? =?
+++=+ ?++ = ?
+ = ?+?=?
+× ?×
= =?
+×+=+ ?×+=?
≈ × =+ + ?×?=? ?
+ T F T
? T T T
F T T
Some particularities of this algebra deserve to be mentioned: the sum of + and
is indeterminate, as is the sum of anything with indeterminate,
inria-00145537, version 2 - 15 May 2007
qualitative equality is reflexive, symmetric but not transitive, because ? is qualitatively equal to anything; this last property is an obstacle against the application of classical elimination methods for solving linear systems. To summarize, we consider experiments that can be modelled as an equilibrium shift of a differential system under a change of its control parameters. In this setting, DNA chips provide the sign of variation in concentration of many (but not necessarily all) species in the network. We consider the signs s(Xieq2 − Xieq1 ) of the variation of some species i between the initial state X eq1 and the final state X eq2 . Both states are stationary and unknown. In [9], we proved that under some reasonable assumptions, in particular if the sign of ∂Fi ∂Xj is constant in states along a path connecting eq1 and eq2, then the following relation holds in sign algebra for all species i: s(Xieq2 − Xieq1 ) ≈
X
j∈pred(i)
s(
∂Fi )s(Xjeq2 − Xjeq1 ) ∂Xj
(1)
where s : R → {+, } is the sign function, and where pred(i) stands for the set of predecessors of species i in the interaction graph. This relation is similar to a linearization of the system F(X, P) = 0. Note however, that as we only consider signs and not quantities, this relation is valid even for large perturbations (see [9] for a complete proof).
2.4
Analyzing a network: a simple example
Let us now describe a practical use of these results. Given an interaction graph, say for instance the graph illustrated in Figure 1, we use Equation 1 at each node of the graph to build a qualitative system of constraints. The variables of this model are the signs of variation for each species. The qualitative system associated to our lactose operon model is proposed in the right side of Figure 1. In order to take into account observations, measured variables should be replaced by their sign values. A solution of the qualitative system is defined as a valuation of its variables, which does not contain any ”?” (otherwise, the constraints would have a trivial solution with all variables set to ”?”) and that, according to the qualitative equality algebra, will satisfy all qualitative constraints in the system. If the model is correct and if data is accurate, then the qualitative system must posses at least one solution. A first step then is to check the self-consistency of the graph, that is to find if the qualitative system without observations has at least one solution. Checking consistency
RR n 6190
6
Guziolowski & al.
LacI A LacZ Li G cAM P LacY
≈ ≈ ≈ ≈ ≈ ≈ ≈
−A LacZ cAM P − LacI Le + LacY − LacZ Li + LacZ −G cAM P − LacI
(1) (2) (3) (4) (5) (6) (7)
Figure 1: Interaction graph for the lactose operon and its associated qualitative system. In the
inria-00145537, version 2 - 15 May 2007
graph, arrows ending with ”>” or ”−|” imply that the initial product activates or represses the production of the product of arrival, respectively.
between experimental measurements and an interaction graph boils down to instantiating the variables which are measured with their experimental value, and see if the resulting system still has a solution. If this is the case, then it is possible to determine if the model predicts some variations. Namely, it happens that a given variable has the same value in all solutions of the system. We call such variable a hard component. The values of the hard components are the predictions of the model. Whenever the system has no solution, a simple strategy to diagnose the problem is to isolate a minimal set of inconsistent equations. In our experiments, a greedy approach was enough to solve all inconsistencies (see next section). Note that in our setting isolating a subset of the equations is equivalent to isolating a subgraph of the interaction graph. The combination of the diagnosis algorithm and a visualization tool is particularly useful for model refinement. Finally, let us mention that we provided in [14] an efficient representation of qualitative systems, leading to effective algorithms, some of them could be used to get further insights into the model under study. We shall see in the next section, that these algorithms are able to deal with large scale networks.
3 3.1
Results Construction of the Escherichia coli regulatory network
For building E.coli regulatory network we relied on the transcriptional regulation information provided by RegulonDB ([10], [11]) on March 2006. From the file containing transcription factor to gene interactions we have built the regulatory network of E.coli as a set of interactions of the form A → B sign where sign denotes the value of the interaction: +, , ?(expressed, repressed, undetermined), and A and B can be considered as genes or proteins, depending on the following situations:
INRIA
Consistency Between Expression Data and Large Scale Regulatory Networks
7
The interaction genA → genB was created when both genA and genB are notified by RegulonDB, and when the protein A, synthesized by genA, is among the transcriptional factors that regulate genB. See Figure 2 A. The interaction T F → genB was created when we found TF as an heterodimer protein (protein-complex formed by the union of 2 proteins) that regulates genB. See Figure 2 B. In E.coli transcriptional network we have found 4 protein-complexes which are: IHF, HU, RcsB, and GatR. The interaction genA → T F was created when we found the transcriptional factor TF as an heterodimer protein and genA synthesizes one of the proteins that form TF. See Figure 2 B. A
B
IhfB
inria-00145537, version 2 - 15 May 2007
Fur fur
IHF fiu
IhfA aceA
Figure 2: Representation of genetical interactions. (A) Negative regulation (repression) of gene f iu by the transcription factor F ur represented as f ur → f iu −. (B) Biological interaction of genes ihf A and ihf B forming the protein-complex IHF represented as ihf A → IHF + and ihf B → IHF +, positive regulation of gene aceA by the protein complex IHF represented by IHF → aceA +
3.2
Adding sigma factors to obtain self-consistency
Using the methods and the algorithms described with detail in [14] we built a qualitative system of equations for the interaction graph obtained from E.coli network. For solving qualitative equations we have used our own tool, the PYTHON module PYQUALI. The system was not found to be self-consistent and we used a procedure available in PYQUALI library to isolate a minimal inconsistent subgraph (see Figure 3). A careful reading of the available literature led us to consider the regulations involving sigma factors which were initially absent from the network. Once added to complete the network, we obtained a network of 3883 interactions and 1529 components (genes, protein-complexes, and sigmafactors). This final network (global network) was found to be self-consistent.
3.3
Compatibility of a network with a set of observations
A compatible network can be tested with different sets of observations of varied stresses: thermal, nutritional, hypoxic, etc. An observation is a pair of values of the form gene = sign where sign can be + or indicating that the gene is expressed or respectively repressed
RR n 6190
8
Guziolowski & al.
ihfA
ihfA
rpoD
ihfB
rpoS
IHF ≈ ihfA + ihfB (1)
IHF
IHF
ihfB
ihfA ≈ −IHF + rpoS + rpoD (2) ihfB ≈ −IHF + rpoS + rpoD (3)
Figure 3: (Left) A minimal inconsistent subgraph, isolated from the whole E.coli regulatory network using PYQUALI. (Right) Correction proposed after careful reading of available literature on ihfA and ihfB regulation.
inria-00145537, version 2 - 15 May 2007
Table 1: Table of the 40 variations of products observed under stationary growth phase condition. Source: RegulonDB March 2006 gene acnA acrA adhE appB appC appY blc bolA
variation + + + + + + + +
gene csiE cspD dnaN dppA fic gabP gadA gadB
variation + + + + + + + +
gene gadC hmp hns hyaA ihfA ihfB lrp mpl
variation + + + + − − + +
gene osmB osmE osmY otsA otsB polA proP proX
variation + + + + + + + +
gene recF rob sdaA sohB treA yeiL yfiD yihI
variation + + − − + + + −
under certain condition. To test the global network of E. coli, we have chosen a set of 40 observations for the stationary phase condition provided by RegulonDB (Table 1). The set of 40 observations of the stationary phase was found to be inconsistent with the global network of E. coli. We found a direct inconsistency in the system of equations caused by the values fixed by the observations given to ihfA and ihfB: {ihf A = , ihf B = }, implying repression of these genes under stationary phase. This mathematical incompatibility agreed with the literature related to genes ihf A and ihf B expression under stationary growing phase. Studies [1],[2],[4],[15] agree that transcription of ihf A and ihf B increases during stationary phase. Supported by this information, we have modified the observations of ihf A and ihf B and the compatibility test of the global network of E.coli was successful.
3.4
Predictions over a compatible network from a set of observations
As mentioned earlier, a regulatory network is said to be consistent with a given set of observations when the associated qualitative system has at least one solution. If a variable is fixed to the same value in all solutions, then mathematically we are talking about a hard component, which is a prediction or inference for this set of observations. We have mentioned that the regulatory network including sigma factors is consistent with the set of 40 observations for stationary phase, after some correction. Actually there are about 2, 66 · 1016 solutions of the qualitative system which are consistent with the 40 observations of stationary phase. Furthermore, in all these solutions, 381 variables of the system have always the same value (they are hard components, see Figure 4). In other words, we were
INRIA
9
Consistency Between Expression Data and Large Scale Regulatory Networks
rcA
evgA
argR
argB
argC
argD
argE
argF
argH
yhbC
emrK
argI
nupC
argG
sufA
sufB
sufC
carA
sufD
carB
sufE
cspE
evgS
cspA
sufS
sucA
srlB
sucB
sucC
malZ
sucD
lamB
srlE
malE
srlD
malF
malK
soxS
cpxA
recF
malG
soxR
malM
lyxK
malT
malP
malQ
cpxP
gntR
cusR
cpxR
fldA
cusA
dsbA
rdoA
baeR
ung
nfnB
hns
putA
metC
fpr
cusC
fldB
marA
malS
cusB
inaA
acrD
nfo
cusF
nfsA
entF
cynR
cusS
ribA
mdtB
cynS
cynT
rimK
entS
cysB
cynX
ybjC
exbB
exbD
poxB
cysA
cysC
cysD
cysH
cysI
ybjN
fepB
gntK
mdtC
mdtD
baeS
marR
mdtA
fepC
acrA
mhpR
ppiA
cysJ
gntU
marB
cysN
cysP
cysU
cysW
idnR
edd
tauB
asnA
cysM
cbl
fepD
acrB
cysK
metF
tauC
metI
tauD
metK
tauA
metN
fepE
metQ
fepG
metB
metE
metJ
metL
metH
gcvR
metR
glyA
gcvB
metA
cvpA
glnB
hflD
prsA
idnK
gcvA
purB
yodA
purC
cytR
purD
fes
purE
fhuA
purF
fhuB
purH
fhuC
purK
idnO
fhuD
fumC
idnT
fhuF
zwf
mhpA
idnD
nohA
purL
glcC
betI
sraI
purM
tonB
purN
pyrC
yhhY
speA
fur
purR
speB
betB
ubiX
pyrD
mhpB
gcvH
betT
gcvP
betA
gcvT
mhpD
rhaS
galR
ahpC
mhpC
cadC
rhaR
udp
cirA
ahpF
ilvY
nrdE
codA
mhpE
cdd
nrdF
glcA
entA
glcB
nrdH
codB
glcD
nrdI
glcE
ygaC
mhpF
glcF
entB
arcA
ilvC
iscR
iscA
lexA
iscS
iscU
dinF
dinG
dnaG
glcG
fhuE
ftsK
polB
ydeA
entC
entD
entE
fepA
fiu
infB
metY
nusA
pnp
cadB
uvrC
uvrY
ydjM
dinQ
hcaR
rbfA
rpsO
galS
cadA
tisB
treB
ybdB
gyrA
sodA
truB
lacA
insK
treC
pgmA
lacY
ulaR
molR_1
glpD
phr
fumA
nupG
rhaT
lacZ
fecR
chiA
ruvA
mukB
lysR
ruvB
tsx
tisA
recA
paaJ
hcaB
hcaC
pqiA
mukE
paaK
aldA
paaI
hcaD
pqiB
recN
recX
paaH
hcaE
rpsU
paaG
hcaF
ssb
paaF
sulA
paaE
umuC
paaD
umuD
paaC
uvrA
paaB
uvrB
paaA
uvrD
gltA
mngR
lysA
moeB
moeA
oxyR
mtr
mngA
ulaA
ulaB
ulaC
ulaD
emrR
emrA
phoB
mngB
ulaE
paaX
ulaF
fucR
frdA
ptsH
asr
frdB
ptsI
b4103
frdC
phnC
frdD
phnD
crr
phnE
focA
phnF
pflB
phnG
glnA
phnH
glnL
phnI
nagA
phnJ
uxuA
dcuR
uxuB
gutM
argP
appB
trpR
phnK
fdnG
fdnH
fdnI
phnL
phnM
hyaB
hyaC
gutQ
emrB
phnN
phnO
hyaD
srlA
phnP
hyaE
nrdA
hyaF
phoA
phoE
aceB
nrdB
aceK
polA
guaA
guaB
phoH
phoR
norV
norW
fumB
psiF
hypB
dnaN
kdgR
hypC
malI
melR
rob
aroH
grxA
sgbH
trpA
fucA
sgbU
ptsG
lpdA
acs
trpB
fucI
dusB
trpC
fucK
fadL
trpD
fucO
yjcG
trpE
fucP
yjcH
tyrR
fucU
yiaO
trpL
bacA
nuoA
aroL
dsbC
nuoB
aroM
ecfI
nuoC
yaiA
htrA
nuoE
aroF
psd
nuoF
xapR
aroG
smpA
nuoG
aroP
tyrA
srlR
nuoH
nuoI
tyrB
xapA
dctA
nuoJ
nuoK
dsdC
xapB
rbsR
gntT
dnaA
nuoL
nuoM
eda
nuoN
tdcD
gntX
phoU
tdcE
gntY
agn43
tdcB
mukF
cyaA
pstA
tdcF
glgA
katG
tdcG
glgC
trxC
gltD
glgP
pstB
gltF
dsdA
pstC
gltB
psiE
malX
yeiL
birA
crp
dsdX
pstS
rpoD
ugpA
ugpB
malY
acrR
ugpC
melA
asnC
ugpE
melB
tyrP
ugpQ
rbsA
glpR
rbsB
rbsC
lacI
smtA
fnr
rbsD
rbsK
fecI
fimA
fecA
glpK
alaT
fecB
fecC
alaW
fecD
galP
alaX
fecE
argU
fimC
argW
fimD
argX
fimF
aspV
fimG
mntH
fimH
csgD
fimI
glnU
csgE
csgF
glnV
glnW
glnX
glpX
gltV
glpF
glyT
glpQ
glyU
gyrB
glpT
hisR
gntP
ileT
dps
leuP
hpt
leuQ
spf
nagB
leuT
nagD
leuV
leuW
nagE
glmS
sgbE
glmU
leuX
acnA
lysT
lysV
yiaK
trxA
agp
paaZ
dcuB
ompR
yiaL
yiaM
yiaN
pgk
glpE
yiaJ
rhaA
glpG
appY
rhaB
rhaD
b0725
lctP
lctD
ackA
nmpC
sdhA
sdhB
sdhC
sdhD
hlyE
galM
prpR
lctR
csgG
micF
norR
galT
sigma19
xylR
sodB
bglB
galE
galK
bglF
bglG
chbR
hupA
xylA
xylB
xylG
dcuC
chbC
xylH
xylF
chbF
HU
pgm
chbG
appC
chbA
chbB
lysW
mazE
appA
mazF
hemA
metT
metU
tnaA
pheU
araC
tnaB
pheV
hyfR
ilvB
tnaC
proK
ilvN
agaR
ivbL
mpl
fis
proL
proM
rrfA
rrfE
accA
rrlE
deoA
deoB
deoC
prpD
deoD
rrfB
rrfC
accC
ebgR
accD
ampC
sohB
prpC
rrlA
accB
rrfD
nagC
prpB
acnB
prpE
araA
araB
hupB
rrfF
amyA
araD
queA
rrfG
rrfH
rrlB
rpe
ompA
araE
aroK
putP
araH
aroB
speC
araF
gph
ubiG
araJ
xseA
araG
yhfA
rrlG
seqA
rrlH
rrnB
rrnC
rrnD
trpS
cpdB
arsC
arsR
arsB
narG
aldB
rrnG
adhE
rrnH
asnT
asnU
asnV
atpG
cstA
hyaA
proP
rrlD
dam
nanC
trmA
rrlC
damX
thrV
glnQ
alaU
ygjG
alaV
narH
aidB
gltT
narI
narJ
narK
slyA
gltU
atpH
ileU
ileV
atpC
fadD
uspA
treR
aceA
envZ
epd
csiR
gltW
atpD
atpE
ansB
rrnE
bolA
atpA
aspA
fbaA
rrnA
cspD
nanR
atpF
dcuA
gcd
agaA
manX
serT
serX
csiE
atpI
agaV
manY
thrT
bcp
agaW
tpr
gabP
btuB
ebgA
hybO
thrW
csgB
bglX
agaZ
manZ
thrU
csgA
atpB
cedA
ebgC
chpS
gapA
chpB
clpA
mlc
cls
nanA
cmk
nanE
aceF
fadR
nanK
rpsA
corA
creD
rpoE
creC
tyrU
gabT
tyrV
gabD
valT
ygaF
valU
csiD
valX
valY
creB
csrB
cutA
dapA
dcuD
dut
nanT
cydC
ksgA
slmA
efp
narL
icd
tyrT
creA
pdxA
ompF
serA
dadA
cydD
dadX
hybA
hybB
hybC
hybD
fimB
inria-00145537, version 2 - 15 May 2007
fliA
cheA
cheW
motA
motB
tar
trg
aer
flgK
flgN
mglA
flgM
fliD
fliS
fliT
flxA
fliL
fliM
fliN
fliO
fliP
fliQ
ycgR
yhjH
fliC
mglB
mdh
mglC
fliY
fliZ
glpA
glpB
gloA
gltX
tdcR
hybE
hybF
glyV
pdhR
hybG
ulaG
gor
glyX
narV
glyY
narW
cysT
narY
glyW
narZ
tdcA
leuZ
rimM
gmr
rplE
hypD
hemN
rplF
hypE
hepA
rplM
hycA
fdx
hscA
rplN
hycB
rplO
hycC
hscB
rplR
hycD
ileX
rplS
hycE
infC
rplT
hycF
kdsA
kdtA
rplX
hycG
rpmD
hycH
leuU
rpmJ
lysC
rpsE
hycI
mtlA
mntR
menA
rpsI
menB
rpsN
caiF
fixA
fixB
fixC
fixX
asnB
ybaS
gadX
gadB
cysG
gadC
nirB
nirC
caiA
caiB
caiC
menC
rpsP
yfiD
menE
rpsS
aceE
nikA
mtlD
hdeD
lysP
rpsH
mtlR
flhD
tsr
fxsA
metV
secY
nikR
nikB
metW
nikC
speD
tpx
nikD
metZ
speE
focB
nikE
mfd
ddlB
ftsI
mraY
murD
murE
murG
ftsL
mraZ
mraW
murC
murF
ftsW
yebC
trmD
hyfA
ruvC
narX
hyfB
hyfC
hyfD
hyfE
hyfF
hyfG
hyfH
hyfI
hyfJ
ubiA
ubiC
cyoA
hcp
cyoB
nudB
gatB
paaY
gatC
gatD
hcr
cyoC
pcnB
folK
gatY
pgi
gatZ
yhcH
fadA
cyoD
cyoE
tdcC
pheP
fadB
pntA
iclR
pntB
pth
fadI
ychF
fadJ
relA
rfaI
nrdD
rfaJ
rfaP
uxuR
uidA
uidB
nrdG
lpxA
caiE
hdeA
hdeB
hdfR
lrhA
qseB
rplB
lpxD
rfaS
rfaY
rplC
rfaZ
rplD
skp
rseB
fabZ
rfaK
rplP
rfaB
rplV
rfaG
rplW
rnb
sbmA
rnc
rpmC
acpS
sixA
pdxJ
rpsC
recO
apaH
era
rpsJ
yhbH
rpsQ
ptsN
agaB
npr
agaC
yhbJ
agaD
rplQ
agaI
rpoA
apaG
rpsD
agaS
rpsM
agaY
rpsK
yaeL
serU
ybaB
serV
yeaY
argV
yfeY
argY
yiiS
gnd
argQ
cutC
purA
argZ
ecfA
amiA
serW
ecfD
dppB
ssuC
yggN
dppC
ssuE
ecfG
ssuD
ecfH
ssuB
ecfJ
ssuA
ecfK
thrL
thrA
ecfL
thrB
fkpA
dppD
thrC
tufA
dppF
upp
fusA
hemF
uraA
surA
fruA
valV
lpxP
fruB
valW
rseC
fruK
ves
rseA
pckA
xseB
nlpB
pfkA
dxs
imp
ppsA
yajO
ispA
mdoG
pykF
napA
yebB
rihB
ychA
ychQ
yjaB
hflK
hfq
yjeF
mdoH
napB
yjeE
uhpT
fdhF
torA
ndh
nirD
caiD
rfaQ
torC
torD
napC
napD
napF
miaA
uxaC
napG
exuT
napH
hflX
hflC
uxaB
amiB
uxaA
dmsA
mutL
uhpA
glgS
dmsB
hmp
dmsC
ccmA
ccmB
talA
gatA
ccmC
ccmD
fhlA
alkA
ccmE
ccmF
ccmG
ada
alkB
ccmH
alsA
potF
kbl
cydA
exuR
gdhA
alsB
astC
oppA
astD
alsC
astE
alsR
alsE
astA
lsrR
alsI
potG
glnG
potH
potI
astB
nac
argT
oppB
ddpA
oppC
fabA
ddpB
ddpC
oppD
pspF
ddpD
ddpF
oppF
ddpX
hisJ
tdh
hisM
lsrA
hisP
ilvH
ilvI
hisQ
lsrB
ycdG
livF
lsrC
ycdH
livG
rpoN
lsrD
lsrF
lsrG
ycdI
fabB
ycdJ
livH
topA
ycdK
livJ
ycdL
livK
cydB
livM
ycdM
lysU
modB
yeaG
yeaH
sdaA
modC
yhdW
mgtA
yhdX
uidR
yhdY
yhdZ
amtB
glnK
flhC
narP
gltI
lrp
leuA
gadW
fimE
caiT
glpC
modA
stpA
glnH
leuB
nhaA
ydeO
leuC
proV
leuD
nhaR
aroA
glnP
serC
gltJ
dppA
gltL
pspA
deoR
pspB
torR
pspC
gltK
pspD
nrfA
hypF
nrfB
nrfC
nrfD
nrfE
nrfF
RR n 6190
gene cpxR crp cusR cynR cysB cytR dnaA dsdC evgA
variation + + + + + + + + +
gene fucR fur galR gcvA glcC gntR ilvY iscR lexA
variation + + + + + + + + +
gene lysR melR mngR oxyR phoB prpR rbsR rhaR rpoD
variation + + + + + + + + +
gene rpoS soxR soxS srlR trpR tyrR
hypA
zraP
zraR
gadE
moaE
cbpA
moaA
tufB
cfa
moaB
narU
moaC
rsd
moaD
otsB
copA
cueO
fabR
rfaC
rfaL
rfaF
leuO
ilvA
ilvD
ilvE
gadA
ilvG_1
atoE
atoC
atoD
bioF
atoB
bioA
hldD
bioB
wcaB
wza
wzb
bioC
ilvG_2
ilvL
ilvM
xthA
IHF
ihfB
ecpD
htrE
osmY
osmE
rtcA
alpA
slp
kdpE
bioD
atoA
lon
nrfG
variation + + + + + +
zraS
yfhK
otsA
rcsA
able to predict the variation: expressed (+) or repressed ( ) of 381 components of our network (25% of the products of the network). We provide a subset of these predictions in Table 2.
variation + + + + + + + + +
puuP
ihfA
proW
fliR
Table 2: Table of 42 products inferred under stationary phase condition.
pspG
ybhK
cueR
ompC
Figure 4: Global E.coli regulatory network with transcriptional and sigma-factors interactions (3883 interactions and 1529 products). Blue and red interactions represent activation or, respectively, repression. Green and blue nodes correspond to positive and negative observations (40). Red nodes (381) are the total inferred variations of products under stationary growth phase condition.
gene IHF ada agaR alsR araC argP argR baeR cadC
pspE
yaiS
leuL
wcaA
hydN
kch
envY
rpoS
proX
hflB
rtcB
rrmJ
ppiD
ybjP
yehW
yehY
yehZ
fruR
yehX
yggE
rtcR
ibpB
yiaG
ldcC
fic
treA
blc
ftsQ
ansP
clpB
clpX
clpP
dnaJ
dpbA
dnaK
gadY
groS
katE
grpE
msyB
hslV
pfkB
hslU
tktB
htgA
uspB
htpG
modE
htpX
rpoH
htrC
ibpA
phoP
pphA
kdpA
borD
hemL
kdpB
mgrB
nadR
kdpC
pagP
pncB
phoQ
nadB
rstA
zntR
zur
gatR_2
gatR_1
zntA
znuC
znuB
GatR
rstB
slyB
ybjG
yrbL
glk
adiY
adiA
rcsB
RcsB
osmC
wzc
bdm
sra
osmB
ftsA
ftsZ
sdiA
yihI
allA
allR
allB
allS
gcl
allC
allD
ylbA
glxK
dhaR
glxR
hyi
ybbV
ybbW
ybbY
dhaK
dhaL
sdaR
dhaM
garD
garK
garL
garP
garR
gudD
gudP
gudX
rnpB
10
3.5
Guziolowski & al.
Validation of the predicted genes
In order to verify whether the 381 predictions obtained from stationary phase data were valid, we have compared them with three sets of microarray data related to the expression of genes of E.Coli during stationary phase. The result obtained is showed in Table 3. The number of compared genes corresponds to the common genes, the validated genes are those genes which variation in the prediction is the same as in the microarray data set. Table 3: Validation of the prediction with microarray data sets Source of microarray data
inria-00145537, version 2 - 15 May 2007
Gutierrez-Rios and co-workers [6], stationary phase Gene Expression Omnibus ([3],[5]), stationary phase after 20 minutes Gene Expression Omnibus ([3],[5]), stationary phase after 60 minutes
Compared genes 249 292 281
Validated genes (%) 34% 51.71% 51.2%
From the sets of microarray data provided by GEO (Gene Expression Omnibus) for stationary phase measured after 20 and 60 minutes, we have taken into account gene expressions whose absolute value is above a specific threshold and compared only these expression data with the 381 predictions. The percentage of validation obtained for different values of thresholds is illustrated in Figure 5. This percentage increases with the threshold, which is normal because stronger variations are more reliable.
Figure 5: (Left) Percentage of validation of the 381 predicted variations of genes with microarray data sets from GEO (Gene Expression Omnibus) for stationary phase after 20 and 60 minutes. For both experiments we validate the 381 predictions with different sets of microarray observations considering only those genes which absolute value of expression is above certain value (threshold). (Right) Number of genes considered for the validation for the different used thresholds of both microarray data sets.
INRIA
Consistency Between Expression Data and Large Scale Regulatory Networks
11
The percentage of our predictions that does not agree with the microarray results is due to: Erroneous microarray indications for certain genes. The genes xthA, cf a, cpxA, cpxR, gor are predicted as expressed by our model and as repressed by the microarray data [6]. Nevertheless, there is strong evidence that they are expressed during the stationary phase (see [7, 16]). Incompleteness of our network model. Our model predicts that the gene ilvC is expressed, which contradicts microarray data. More careful studies [17] document the decrease of the protein IlvC due to an interaction with clpP which is absent in our model. Indeed, under the introduction of a negative interaction between these species, ilvC is no longer a hard component, which lifts the conflict with data.
inria-00145537, version 2 - 15 May 2007
4
Conclusions
Given an interaction graph of a thousand products, such as E.coli regulatory network, we were able to test its self-consistency and its consistency with respect to observations. We have used mathematical methods first exposed in [9, 12, 14]. We have found that the E.coli transcriptional regulatory network, obtained from RegulonDB site [10],[11] is not self consistent, but can be made self-consistent by adding to it sigma-factors which are transcription initiation factors. The self-consistent network (including sigma-factors) is not consistent with data provided by RegulonDB for the stationary growth phase of E.coli. Sources of inconsistency were mistaken observations. Finally, a step of inference/prediction was achieved being able to infer 381 new variations of products (25% of the total products of the network) from E.coli global network (transcriptional plus sigma-factors interactions). This inference was validated with microarray results, obtaining in the best case that 40% of the inferred variations were consistent (37% were not consistent and 23% of them could not be associated to a microarray measure). We have used our approach to spot several imprecisions in the microarray data and missing interactions in our model. This approach can be used in order to increase the consistency between network models and data, which is important for model refinement. Also, it may serve to increase the reliability of the data sets. We plan to use this approach to test different experimental conditions over E.coli network in order to complete its interaction network model. It should be also interesting to test it with different (signed and oriented) regulatory networks. All the tools provided to arrive to these results were packaged in a Python library called PYQUALI which will be soon publicly available. All scripts and data used in this article are available upon simple request to the authors.
RR n 6190
12
Guziolowski & al.
References [1] T Ali Azam, A Iwata, A Nishimura, S Ueda, and A Ishihama. Growth PhaseDependent Variation in Protein Composition of the Escherichia coli Nucleoid. J Bacteriol, 181(20):6361–6370, August 1991. [2] M Aviv, H Giladi, G Schreiber, AB Oppenheim, and G Glaser. Expression of the genes coding for the Escherichia coli integration host factor are controlled by growth phase, rpoS, ppGpp and by autoregulation. Mol Microbiol, 14(5):1021–1031, Dec 1994. [3] T Barrett, T O Suzek, D B Troup, S E Wilhite, W C Ngau, P Ledoux, D Rudnev, A E Lash, W Fujibuchi, and R Edgar. NCBI GEO: mining millions of expression profiles-database and tools. Nucleic Acids Res, 33:D562–6, Jan 2005.
inria-00145537, version 2 - 15 May 2007
[4] M D Ditto, D Roberts, and R A Weisberg. Growth phase variation of integration host factor level in Escherichia coli. J Bacteriol, 176(12):3738–3748, June 1994. [5] R Edgar, M Domrachev, and A E Lash. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res, 30(1):207–10, Jan 2002. [6] Rosa Maria Gutierrez-Rios, David A Rosenblueth, Jose Antonio Loza, Araceli M Huerta, Jeremy D Glasner, Fred R Blattner, and Julio Collado-Vides. Regulatory network of Escherichia coli: consistency between literature knowledge and microarray profiles. Genome Res, 13(11):2435–2443, Nov 2003. Evaluation Studies. [7] A Ishihama. Functional modulation of Escherichia coli RNA polymerase. Annu Rev Microbiol, 54:499–518, 2000. [8] B Kuipers. Qualitative reasoning. MIT Press, 1994. [9] Ovidiu Radulescu, Sandrine Lagarrigue, Anne Siegel, Philippe Veber, and Michel Le Borgne. Topology and static response of interaction networks in molecular biology. J R Soc Interface, 3(6):185–196, Feb 2006. [10] H Salgado, S Gama-Castro, M Peralta-Gil, E Diaz-Peredo, F Sanchez-Solano, A SantosZavaleta, I Martinez-Flores, V Jimenez-Jacinto, C Bonavides-Martinez, J SeguraSalazar, A Martinez-Antonio, and J Collado-Vides. RegulonDB (version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon organization, and growth conditions. Nucleic Acids Res, 1(34):D394–7, Jan 2006. [11] H Salgado, A Santos-Zavaleta, S Gama-Castro, M Peralta-Gil, MI Penaloza-Spinola, A Martinez-Antonio, P D Karp, and J Collado-Vides. The comprehensive updated regulatory network of Escherichia coli K-12. BMC Bioinformatics, 7(1):5, Jan 2006.
INRIA
Consistency Between Expression Data and Large Scale Regulatory Networks
13
[12] A Siegel, O Radulescu, M Le Borgne, P Veber, J Ouy, and S Lagarrigue. Qualitative analysis of the relation between DNA microarray data and behavioral models of regulation networks. Biosystems, 84(2):153–174, May 2006. [13] Donna K Slonim. From patterns to pathways: gene expression data analysis comes of age. Nat Genet, 32 Suppl:502–508, Dec 2002. [14] Philippe Veber, Michel Le Borgne, Anne Siegel, Sandrine Lagarrigue, and Ovidiu Radulescu. Complex qualitative models in biology: A new approach. Complexus, 2(34):140–151, 2004. [15] A Weglenska, B Jacob, and A Sirko. Transcriptional pattern of Escherichia coli ihfB (himD) gene expression. Gene, 181(1-2):85–8, Nov 1996.
inria-00145537, version 2 - 15 May 2007
[16] D Weichart, R Lange, N Henneberg, and R Hengge-Aronis. Identification and characterization of stationary phase-inducible genes in Escherichia coli. Mol Microbiol, 10(2):405–20, Oct 1993. [17] D Weichart, Querfurth N, Dreger M, and Hengge-Aronis R. Global role for ClpPcontaining proteases in stationary-phase adaptation of Escherichia coli. J Bacteriol, 185(1):115–125, Jan 2003.
RR n 6190
14
Guziolowski & al.
Contents 1 Introduction
inria-00145537, version 2 - 15 May 2007
2 Mathematical framework 2.1 Introductory example . . . . . . . . . . 2.2 Equilibrium shift of a differential system 2.3 Qualitative constraints . . . . . . . . . . 2.4 Analyzing a network: a simple example
3 . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
3 3 4 4 5
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
3 Results 3.1 Construction of the Escherichia coli regulatory network . . . . . 3.2 Adding sigma factors to obtain self-consistency . . . . . . . . . . 3.3 Compatibility of a network with a set of observations . . . . . . . 3.4 Predictions over a compatible network from a set of observations 3.5 Validation of the predicted genes . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
6 . 6 . 7 . 7 . 8 . 10
4 Conclusions
. . . .
11
INRIA
inria-00145537, version 2 - 15 May 2007
Unité de recherche INRIA Rennes IRISA, Campus universitaire de Beaulieu - 35042 Rennes Cedex (France) Unité de recherche INRIA Futurs : Parc Club Orsay Université - ZAC des Vignes 4, rue Jacques Monod - 91893 ORSAY Cedex (France) Unité de recherche INRIA Lorraine : LORIA, Technopôle de Nancy-Brabois - Campus scientifique 615, rue du Jardin Botanique - BP 101 - 54602 Villers-lès-Nancy Cedex (France) Unité de recherche INRIA Rhône-Alpes : 655, avenue de l’Europe - 38334 Montbonnot Saint-Ismier (France) Unité de recherche INRIA Rocquencourt : Domaine de Voluceau - Rocquencourt - BP 105 - 78153 Le Chesnay Cedex (France) Unité de recherche INRIA Sophia Antipolis : 2004, route des Lucioles - BP 93 - 06902 Sophia Antipolis Cedex (France)
Éditeur INRIA - Domaine de Voluceau - Rocquencourt, BP 105 - 78153 Le Chesnay Cedex (France)
ISSN 0249-6399