Metabolic network visualization using constraint planar graph drawing algorithm Romain Bourqui, David Auber LaBRI, Universit´e Bordeaux I 351 Cours de la lib´eration 33405 Talence CEDEX, France
[email protected],
[email protected]
Vincent Lacroix BAOBAB Team, Inria Rhˆone-Alpes, Projet HELIX ´ Laboratoire de Biom´etrie et Biologie Evolutive UMR5558 CNRS UniversitLyon 1, 43 bd 11 nov, 69622, Villeurbanne CEDEX - France
[email protected]
Fabien Jourdan UMR1089 X´enobiotiques INRA-ENVT, 180 chemin de Tournefeuille - St-Martin-du-Touch, BP 3,31931 Toulouse CEDEX - France
[email protected] Abstract A metabolic network is a set of interconnected metabolic pathways (subnetworks). Until recently, metabolic studies were dedicated to a single pathway, but current researches now consider the entire network. As matter stands, existing visualization tools cannot be used to undertake these global studies since they have been designed to probe metabolic pathways. For the purpose of making it feasible, this paper presents a graph drawing algorithm for the whole metabolic network. Our collaboration with biologists led us to introduce drawing constraints which take into account the decomposition of the network into metabolic pathways as well as biochemical textbook drawing conventions. These constraints raise numerous graph drawing problems which are solved by first recursively decomposing the network then applying suitable graph drawing algorithms. Finally, we present an application that illustrates the advantage of this representation when visualizing groups of reactions which span several metabolic pathways.
1
Introduction
Metabolism corresponds to the set of molecular transformations or energy transfers that occur in the cell or in a living organism. This process is either degradation (catabolism) or synthesis (anabolism) of molecules [17]. A classical way to study metabolism consists in focusing on small subsets of biochemical reactions which are called metabolic pathways. For instance, if a biologist studies the
set of reactions responsible for the conversion of glucose into fat, he will refer to the pathway represented on figure 1. The main limit of this approach is that metabolic pathways are considered as disjoint processes. This assumption is clearly not verified since most metabolites are shared among several pathways. In a ’Systems Biology’ perspective, metabolism is analyzed globally, that is, by integrating all the metabolic pathways into a single network. The benefit of this approach is to enable to study the crosstalk between pathways and, in particular, it facilitates the discovery of alternative pathways. In practice, to achieve these global studies, the network is modeled by a graph and analyzed with graph algorithms (for relevant examples see [6, 15]). During a metabolic study, biologists like to refer to reference pathway representations like the ones shown in figure 1 or 3. These pathways have a number of interesting advantages in the way the information is presented but their major drawback is that they have been drawn manually and therefore cannot be easily updated. Thus, in addition to the objective of drawing the whole network, a graph drawing algorithm is also motivated by both the regular updates applied to each metabolic network and by the variability between the metabolic networks of different organisms. A general way to create a metabolic network consists in first automatically generate it (according to genetic and bibliographic information) and then manually validate it. Thus, the version of the graph associated to the metabolic network of an organism is very likely to evolve. Moreover,
work. Previous works [7] and our collaboration with biologists revealed that it is essential to take into account the biochemical textbook drawing conventions (see figure 1). Moreover, since the metabolic pathway information is valuable during a metabolic study (even global), this information was used during the computation. That is, a constraint was defined such that nodes belonging to the same pathways have to be drawn close to each other. This algorithm was implemented in the Tulip graph visualization framework [1], and was used for a specific global analysis task. This task comes from a study by Lacroix et al. who developed an algorithm that identify n occurrences (connected small subgraphs) of a given motif (defined as a set of reaction types) into the network [11]. Our method was used to visualize the n occurrences in the whole network thereby enabling to formulate biological hypotheses that could not be formulated at the pathway level. We will first describe the metabolic network and how it could be modeled. Then we will present our algorithm and finally show how it was successfully applied to the occurrence visualization task.
Figure 1: Glycolysis and Gluconeogenesis metabolic pathways as they are provided on KEGG [8] website. This representation is manually drawn, following textbook drawing conventions. Note that the round rectangular nodes specify other pathways since this pathway is only a small part of the whole metabolic network.
2 Material Modeling the metabolic network consists in choosing which biological bodies will be associated to nodes and how edges will represent biochemical reactions. It is necessary to do this model description before introducing the graph drawing algorithm, since it will constraint the representation. For instance, a model may imply that some nodes have a high degree, thus complicating a planarization process.
metabolic studies are usually carried out for several organisms. For instance, the BioCyc Database1 contains information on 150 different metabolic networks [9]. The main focus is on microorganisms [4, 10], but some are dedicated to plants and mammals [16]2 . Finally, the cross-species variability can be important. As an example, the whole photosynthesis pathway occurs in plants but not in mammals. Since metabolic studies have been focusing on the notion of pathway, existing visualization tools and graph drawing algorithms are mainly dedicated to pathway visualization [3, 7]. As already mentioned, using such tools does not enable to study cascades of reactions that span several pathways. Moreover, directly applying these algorithms to network visualization is not possible since they would disregard the hierarchical pathway structure of the network. Therefore, new drawing algorithms are needed. To provide a suitable representation we propose a graph drawing algorithm that can be applied to the whole net-
2.1 Bipartite graph modeling A metabolic network is a set of biochemical reactions, that is, reactions that convert one or more compounds into one or more other compounds. Different models could be used (for a detailed discussion, see [18]). In the one we chose3 , their are two kinds of nodes : reactions and compounds (see figure 2) and there is an edge between a reaction and a compound if the compound is consumed or produced by the reaction. This graph is generally called the ”bipartite graph” since its set of nodes can be split into two subsets where the elements are not linked (no link between reactions and no link between compounds). Thus the network is modeled by a graph G = (V, E) such that V = (R, S) where V = R ⊕ S and E = {(u, v)|u ∈ R, v ∈ S}. A metabolic reaction may be reversible, meaning that it can occur in both directions. In some cases, one direction is clearly preferred (due to thermodynamical constraints)
1 http://biocyc.org/ 2 Note that rat (Rattus Norvegicus) and mouse (Mus musculus) which are relevant models for biological studies are still not referenced. When they are described, our method will be able to provide a drawing.
3 The discussion of this choice is out of the scope of this paper. But the main motivation is due to the use of this model in many textbook drawings.
2
3 Drawing algorithm The aesthetic constraints inherited from textbook drawing have to be taken into account when designing a metabolic network drawing algorithm. In a first section, we will describe these conventions, then we will present the algorithm. Finally we will discuss the various approaches which led us to this method.
3.1 Drawing constraints related to metabolism specificity Figure 1 and figure 3 present two manually drawn representations. A precise analysis of these representations led to identify the so-called ”textbook drawing conventions” (for a detailed description see [3, 7]). It is important to note that the representation choices generally correspond to a biological reality. Cyclic degradation processes, for instance the Krebs cycle, are emphasized by using a circular representation. Reaction cascades are also considered as canonical representation bodies. Textbook representations also follow aesthetic criteria that are often used by the graph drawing community [2]. The most important one is that the number of edge-edge crossing is as low as possible. Moreover, edge routing is designed to maximize the angle between edges going out of a node (this notion is called the ”angular resolution”). These constraints were already used at the pathway level. Since we would like to visualize the whole network while keeping the pathway information we propose a new constraint. That is, nodes contained in a pathway are placed as close as possible to each other.
Figure 2: Bipartite graph describing two biochemical reactions.
but in general it seems to be hard to assess which one is the main direction. Indeed, the information available in databases on this subject is often contradictory. Thus, we choose to consider all reactions as reversible and model the graph with non-oriented edges.
2.2
Metabolic pathways
A metabolic pathway is a subset of the metabolic network, that is a graph P = (Vp , Ep ) where Vp ⊂ V and Ep ⊂ E. Thus, for a given metabolic network G, a set of n metabolic pathways is defined as : PG = {Pi | 0 < i ≤ n}. Note that two metabolic pathways may share some nodes. In classical representations these shared nodes are duplicated. That is, for each occurrence of this compound (or reaction) in a pathway, a node is created. This solution is not suitable for global analysis since it could lead to erroneous conclusions. For instance, a reaction cascade shared by two pathways will be represented twice, in different areas of the drawing. Thus, users may think that two similar cascades occur in the network while it is not the case. This is the reason why we chose not to duplicate nodes even if this raises graph drawing problems (see section 3).
3.2 Our method The algorithm we use has two main phases : first, a multi-scale clustering is performed creating a quotient graph4 , and second, clusters and quotient graph are drawn using three drawing algorithms. In the next section, we will first explain our clustering algorithm and then, we will present the drawing algorithms we use. 3.2.1 Multi-scale clustering One of the main problem is that metabolic pathways often share nodes. For instance see 4 where the yellow, the blue and the green region respectively represent pathways p1 , p2 and p3 , one can see an overlap between p1 and p2 (one node) and between p2 and p3 (four nodes). Since we chose not to duplicate nodes and since vertices of a pathway have to be drawn next to each other, our algorithm will have to decide whether a node will be embedded next to a pathway or next to another. For example, the shared node between p1 and p2 could be drawn near p1 or near p2 . This will be achieved by a two-step process. The first step
Figure 3: Boehringer poster representing the whole metabolic network.
4 Strictly
speaking, the quotient graph is built by considering isolated nodes as singletons.
3
p2 p1
chosen as our independent set. Let PN ind = PG /Pind . For all the pathways in PN ind , we will exclude nodes that are shared with at least one other pathway in PG . We denote this reduced set by PN0 ind .
p3
p5 a
p4
We clusterize elements in Pind and in PN0 ind into metanodes, i.e., we replace each subgraph induced by an element of Pind or PN0 ind by a node representing it (see figure 4.b). We call this first graph G1 .
b
For all the pathways in Pind and in PN0 ind , we search for the longest independent cycles5 (on figure 4.c, longest independent cycles of each element of Pind and PN0 ind are highlighted in red). These cycles are clusterized into metanodes yielding a multi-scale graph called G2 . Second pass : detection of cycles and paths The next step of the algorithm consists in computing the longest independent cycles6 in G2 , excluding metanodes. At each iteration, we clusterize the longest cycle into a metanode and exclude it for the next search. We then compute the longest paths of nodes of degree less than or equal to two. In figure 4.d, one can see the two new metanodes, the left one is a path and the other one is a cycle. The result of this clustering is the quotient graph that will be the input of the drawing algorithm.
d
c
e
Figure 4: Algorithm overview : (a) a network where each pathway is depicted by a color, (b) clustering according to metabolic pathways overlapping, (c) cyles detection in metanodes, (d) cycles and paths detection, (e) final representation
3.2.2 Drawing algorithms To draw the metabolic network, we use three drawing algorithms : one for the quotient graph and two for the metanodes. Drawing metanodes To draw subgraphs represented by metanodes, we use a recursive drawing algorithm. This algorithm draws all the subgraphs from the most nested to the least nested. According to our clustering method, a subgraph is either a cycle or an acyclic graph. In the first case, we use a circular drawing algorithm (see figure 5), in the second case, we use the hierarchical drawing algorithm presented in [1].
consists in computing the independent set and the second one in detecting cycles and paths. First pass : computation of the independent set First of all, the algorithm searches for the subset Pind = {p1 , ..., pind }, ind≥1, Pind ⊆ PG such that 1. Pi=ind the pathways of Pind are independent and 2. i=1 |pi | is maximized. For instance,in figure 4.a, {p1 , p3 } is the independent set that maximizes this sum among all possible independent sets of pathways ({p1 }, {p2 }, {p3 }, {p4 }, {p5 }, {p1 , p3 }, {p1 , p4 }, {p1 , p5 }, {p2 , p4 } and {p4 , p5 }). The problem of finding the maximal independent set is known to be NP-Hard, problem that can be reduced to the coloration problem. To find a solution, we use the Welsh and Powel heuristic [19]. Figure 6.b shows a coloration of pathwaysPobtained with this method. Then, for each color class C, pi ∈C |pi | is computed, and the maximum one is
Drawing quotient graph We want a drawing that optimizes the angular resolution and the number of bends to obtain a better visibility. The Mixed-Model algorithm of C. Gutwenger and P. Mutzel [5] is a trade-off between all these aesthetic criteria. Moreover, 5 This problem is also NP-Complete but if the computation time is greater than a threshold, we stop the algorithm and consider that the maximal cycle we have already found as the longest one. 6 Cycles C and C are independent iff C and C do not share any 1 2 1 2 node.
4
drawings produced by this algorithm are similar to manually drawn metabolic networks (see figure 1). To use the Mixed-Model algorithm, we need to make modifications on the quotient graph. Indeed, it can only be applied on planar graphs, therefore, we have to planarize (i.e. make it planar) the quotient graph. This problem is wellknown and is NP-Hard [13]. Many techniques exist, either by augmentation or by deletion of edges (or nodes). For a survey on this subject, one can refer to [12]. The drawback of an augmentation based technique is that it may add up to |V |4 nodes, thus the drawing becomes difficult to understand. That is why we use our own heuristic : vertices of higher degree are removed one by one until the graph becomes planar. All removed nodes are then re-inserted. Removed edges are re-added one by one as long as the graph is planar. The obtained planar subgraph of the quotient graph is drawn by the Mixed-Model algorithm [5]. To summarize, this algorithm has two steps :
Figure 5: E. coli metabolic network using our method. Drawings on the right are zoom in views where cycles are emphasized.
• The first step builds an ordered partition of the set of nodes. This partition is called shelling ordering. The principle is to remove successively nodes that are on the external face of the graph.
constraints. The first one enables duplication of nodes and the second one doesn’t search for maximum independent set of pathways. The first technique consists in clusterizing each pathway into a metanode and in linking metanodes if their corresponding pathways are not independent. With this method we have to make (k-1) duplications of a vertex belonging to k different pathways. In figure 6.b, nodes highlighted in red have been duplicated. We then compute the longest independent cycles within each metanode (see figure 6.c) and the longest paths in the quotient graph. Advantages of this method are that all reactions and substrates of each pathway can be drawn next to each other, and that we can see in the drawing if any two pathways are independent or not just by looking at the edges. Figure 7 shows that the quotient graph contains a greater number of edges than the one obtained by our algorithm. This implies a large number of edge-edge crossing7 and thus makes this representation very hard to understand. The second technique is like our method, but we did not compute the maximum independent set of pathways (see figure 8.b). The interest of this method is that no vertex is included in a metanode if it also belongs to another pathway. This to see, if two pathways share reactions or substrates, which are these reactions or substrates. The problem of this method is that the resulting drawing can have no pathways entirely drawn in a small region, moreover, the quotient graph has a greater number of vertices and metanodes, and a greater number of edges than with the method we use, so that the final drawing is more difficult
• The second one is the ”recomposition” of the graph according to the shelling ordering. To guarantee there is neither edge-edge crossing nor node-edge overlapping, the ordering is traversed in reverse order. As described in 3.1, if a vertex is in a pathway, it has to be drawn close to other vertices of the pathway. Taking into account such a constraint in the Mixed-Model algorithm can be done during the decomposition phase. Let SO = {V1 , V2 , ..., Vr } be the shelling ordering, when a vertex n is added into a set Vi , 1 ≤ i < r, we prioritary add vertices which have a constraint with n into the next Vj , j > i. Those nodes will be more likely to be drawn next to each other. The last step of our drawing algorithm is to draw edges removed during the planarization step. These edges are routed on the external face, using an orthogonal drawing with three bends per edge. Figure 5 shows the drawing obtained by our algorithm on Escherichia coli (E. coli). This is an organism which has been widely studied, its metabolism is composed of 165 pathways, 1062 substrates and reactions (i.e. nodes) and 1263 links (i.e. edges) between them.
3.3
Discussion
In the previous section, we proposed an algorithm which respects all constraints described in section 3.1. In the following, we discuss two other techniques we have experimented. In both of them, we have removed one of the
7 using
5
a 2D representation
p2 p1
p3
p5
p4
a
duplicated nodes c
b
Figure 7: Dependence graph of E. coli metabolic network, 164 metanodes and 1065 edges
Cyc visualization tool9 . On these drawings we highlighted an occurrence shared by Valine Biosynthesis and Alanine Biosynthesis. The main limit of this way of visualizing the results is that we need several views to be able to see the occurrence. This method can be used for a single occurrence, but it becomes very heavy when the number of occurrences to examine is large. In the context of an exploratory analysis of the metabolic network, several types of search are performed (exact and approximate search) and several motifs are searched for, therefore leading to a very large number of occurrences to examine. In contrary to classical approaches, the method we propose enables to visualize each occurrence in a single view even if it spans several metabolic pathways. The view corresponds to the whole network in which all the reactions belonging to the occurrence are highlighted (see left part of figure 11). The occurrence can then be interpreted as a biological process embedded in a large context, the metabolic network. The way this process is linked to the rest of the network is now explicit. Furthermore, several occurrences may be colored at the same time, therefore enabling the biologist to see, at a glance, the repartition of the occurrences in the network.
d
Figure 6: From figure a to figure b the decomposition process duplicates nodes shared by two pathways. The drawing process (c to d) corresponds to the one we use.
to understand (see figure 9). An enzyme [14] is a protein that catalyzes the conversion of substrates into products. Thus in the metabolic network, each reaction node can be associated to an enzyme. Our case study is based on the hypothesis that a given sequence of enzymes could occur in different parts of the network. To identify these patterns (motifs), Lacroix et al. provided a search algorithm and applied it to the metabolic network of E. Coli8 [11]. Their results show that occurrences of a motif may span two or more pathways. For instance, Figure 10 shows that for the motif {1.1.1.86, 4.2.1.9, 2.6.1.42, 2.6.1.66}, an occurrence is shared by two metabolic pathways. Figure 10 is made of two screen shots of two views provided by the Eco-
4 Conclusion and future works This article presents a graph drawing algorithm dedicated to the visualization of metabolic networks. This method is designed to follow textbook drawing conventions by using suitable graph drawing algorithms. Moreover, decomposition of the network into metabolic pathways is used as a backbone for the representation. Indeed
8 Escherichia coli (E. Coli) belongs to the family of Enterobacteriaceae, and is present in the lower intestine of humans and warm-blooded animals. E. Coli has been studied intensively by geneticists because of its small genome size, normal lack of pathogenicity, and ease of growth in the laboratory.
9 http://biocyc.org/ECOLI/class-subs-instances?object=Pathways
6
p2 p1
p3
p5 a
p4
b
Figure 9: E. coli metabolic network using a clustering excluding shared vertices
d
c
e
Figure 8: Decomposition and drawing process where shared vertices are excluded (a to b).
Figure 10: Metabolic pathways as they are provided on the Biocyc website (compounds are nodes and reactions are hyperedges between nodes). The occurrence of the motif {1.1.1.86, 4.2.1.9, 2.6.1.42, 2.6.1.66} is composed of four reactions belonging to two different pathways.
this notion is central in biological studies on metabolic networks. The result of a global study on the metabolic network of E. Coli is visualized using our method. In contrast with current visualization tools, our overall representation allows to visually detect reaction cascades shared by different metabolic pathways. In the near future, we plan to improve the routing of edges at the end of the quotient graph drawing. Moreover, a dedicated navigation method will be designed, for instance by using ”focus plus context” methods.
References [1] David Auber. Graph Drawing Software, chapter Tulip- A Huge Graph Visualization Framework. Springer-Verlag, 2003. [2] Giuseppe Di Battista, Peter Eades, Roberto Tamassia, and Ioannis G. Tollis. Graph Drawing: Algorithms for the Visualization of Graphs. Prentice Hall, 1999.
Acknowledgments
[3] Moritz Becker and Isabel Rojas. A graph layout algorithm for drawing metabolic pathways. Bioinformatics, 17:461–467, 2001.
The authors would like to thank Marie-France Sagot10 for initiating this collaboration and Ludovic Cottret11 for his work on extracting from BioCyc the data used in this paper.
[4] Ron Caspi, Hartmut Foerster, Carol Fulcher, Rebecca Hopkinson, John Ingraham, Pallavi Kaipa, Markus Krummenacker, Suzanne Paley, John Pick, Seung Y.
10
[email protected] 11
[email protected]
7
[11] V. Lacroix, C. G. Fernandes, and M.-F. Sagot. Reaction motifs in metabolic networks. In Proceedings of 5th Workshop on Algorithms for BioInformatics (WABI’05), Lecture Notes in BioInformatics, subseries Lecture Notes in Computer Science, volume 3692, pages 178–191, 2005. [12] Annegret Liebers. Planarizing graphs - a survey and annotated bibliography. Journal of Graph Algorithms and Applications, 5(1):1–74, 2001. [13] P.C. Lui and R.C. Geldmacher. On the deletion of nonplanar edges of a graph. In Proceeding on the 10th conf. on Comb., Graph Theory, and Comp., pages 727–738, 1977.
Figure 11: Visualization of the metabolic network where the occurrences of the motif {1.1.1.86, 4.2.1.9, 2.6.1.42, 2.6.1.66} are highlighted. The right part corresponds to the occurrence shown on figure 10
[14] Nomenclature Committee of the International Union of Biochemistry and Molecular Biology. Enzyme Nomenclature. Academic Press, 1992. [15] T. Pfeiffer, T. Dandekar, F. Moldenhauer, and S. Schuster. Topological analysis of metabolic networks. application to the metabolism of mycoplasma pneumoniae. In BTK2000: Animating the Cellular Map, pages 229–234, 2000.
Rhee, Christophe Tissier, Peifen Zhang, and Peter D. Karp. Metacyc: a multiorganism database of metabolic pathways and enzymes. Nucleic Acids Research, 34:D511–D516, 2006. [5] C. Gutwenger and P. Mutzel. Planar polyline drawings with good angular resolution. In Graph Drawing ’98 (Proc.), Springer-Verlag, Lecture Notes in Computer Science, volume 1547, pages 167–182, 1998.
[16] P. Romero, J. Wagg, M.L. Green, D. Kaiser, M. Krummenacker, and P.D. Karp. Computational prediction of human metabolic pathways from the complete human genome. Genome Biology, pages 1–17, 2004.
[6] H. Jeong, B. Tombor, R. Albert, Z. Oltvai, and A. Barabasi. The large-scale organization of metabolic networks. Nature, 407:651, 2000.
[17] Jack G. Salway. Metabolism at a Glance. Blackwell Science Ltd, 2003. [18] Jacques van Helden, Lorenz Wernisch, David Gilbert, and Shoshana Wodak. Graph-based analysis of metabolic networks. Ernst Schering Research Foundation Workshop, 38:245–274, 2002.
[7] Fabien Jourdan and Guy Melanc¸on. A tool for metabolic and regulatory pathways visual analysis. In Visualization and Data Analysis, VDA, pages 46–55, Santa Clara Convention Center, January 2003. SPIE. http://vw.indiana.edu/vda2003/.
[19] Welsh and Powell. An upper bound to the chromatic number of a graph and its application to timetabling problems. The Computer journal, 10:85–86, 1967.
[8] Minoru Kanehisa. Post-genome Informatics. Oxford University Press, 2000. [9] P.D. Karp, C.A. Ouzounis, C. Moore-Kochlacs, L. Goldovsky, P. Kaipa, D. Ahren, S. Tsoka, N. Darzentas, V. Kunin, and N. Lopez-Bigas. Expansion of the biocyc collection of pathway/genome databases to 160 genomes. Nucleic Acids Research, 19:6083–89, 2005. [10] I.M. Keseler, J. Collado-Vides, S. Gama-Castro, J. Ingraham, S. Paley, I.T. Paulsen, M. Peralta-Gil, and P.D. Karp. Ecocyc: A comprehensive database resource for escherichia coli. Nucleic Acids Research, 33:D334–D337, 2005. 8