Current Proteomics, 2009, 6, 235-245
235
Proteins as Networks: A Mesoscopic Approach Using Haemoglobin Molecule as Case Study Alessandro Giuliani*,1, Luisa Di Paola2 and Roberto Setola2 1 2
Environment and Health Department, Istituto Superiore di Sanità, Viale Regina Elena 299, 00161, Roma, Italy; Complex System & Security Lab, Università CAMPUS BioMedico, Via A. del Portillo, 21, 00128 Roma, Italy Abstract: Protein structures allow for a straightforward representation in terms of graph theory being the nodes the aminoacid residues and the edges the scoring of a spatial contact between the node pairs. Such a representation allows for a direct use in the realm of protein science of the vast repertoire of graph invariants developed in the analysis of complex networks. In this work we give a general overview of the protein as networks paradigm with a special emphasis on haemoglobin where the most important features of protein systems like allostery, protein-protein contacts and differential effect of mutations were demonstrated to be amenable to a graph theory oriented translation.
1. INTRODUCTION The prediction of protein structure and function from its primary sequence is a task of capital importance in biochemistry [1, 2]. For this reason, since the 50’s, a great number of protein structures has been collected in dedicated databases and many explanatory models of the protein folding have been proposed [3-7]. In these last years, the upcoming of genomics and proteomics and the unprecedented massive amount of data gave the protein sequence-structure-function a further impulse and allowed for a wide spectrum view, falsifying many established principles. Probably the most relevant finding in terms of its consequences on established paradigms was the discovery of the widespread diffusion in eukaryotic cells of natively unfolded proteins and of natively unfolded portions in structured systems [8-12]. The discovery of natively unfolded proteins, i.e. protein systems that perform their physiological role without the need of acquiring a well defined 3D structure, pushed the scientists to enlarge the general idea of ‘structure’ from a specific geometrical pattern to a dynamical system moving around a quasiattractor constituted by an average (but not unique) configuration. This is analogous to the link existing between the topological representation of a given network (its wiring diagram, corresponding to the crystal structure of the protein) and the dynamics of the network itself that are supported (but not barely coincident) with its topology. The protein systems undergoing a lot of interactions with other polypeptides are particularly rich of natively unfolded tracts and these natively unfolded patches were discovered to be involved in both protein-protein interactions and aggregation in many different systems [13, 14]. The consideration of the crucial role of dynamics changed the nature of the protein sequence-structure-function problem. Previously, it was a deterministic task of mapping a given linear arrangement of symbols (or numbers if expressed in terms of their *Address correspondence to this author at the Environment and Health Department, Istituto Superiore di Sanità, Viale Regina Elena 299, 00161, Roma, Italy; E-mail:
[email protected]
1570-1646/09 $55.00+.00
chemico-physical properties) corresponding to the aminoacid residues along the polypeptide chain, into a threedimensional spatial coordinate system defining each single residue in space. Now we work on a mainly stochastic affair of finding the ‘most crucial’ portion of a given structure (sequence) for a given task or property of the system at large. In other words, the sequence/structure prediction task turned into the definition of the mesoscopic principles linking sequence features (like complexity of the residue arrangement along the chain or their periodicity) to functional features in which structural and dynamical perspectives are strictly intermingled: interaction/aggregation properties, flexibility, stability, allosteric properties. The protein folding historical paradigm which assumed that all the information required to determine the protein three dimensional native structure and the folding pathway is all contained in the primary structure [15] added of the ‘dynamical flavour’ of the so called ‘Levinthal paradox’ [16, 17]. It points to the discrepancy between the observed folding kinetics (time scale of few seconds) and those forcasted by imagining a random search of the steadiest structural configuration (time scale of many years). Only adding another dimension, i.e. the definition of a ‘mesoscopic’ viewpoint to protein system, allows to extract relevant functional hints about the system of interest. In other words, the sequence information not only has embedded the information relative to the final 3D structure of the protein but even the ‘procedural information’ for reaching its native structure in due time. This point recently underwent a very rigorous and systematic experimental approach by Niwa et al. [18] that faced the problem of assessing the ‘intrinsic foldability’ of E. coli proteome (made of around 3.000 different proteins). While all these proteins acquire a globular soluble configuration in vivo, only a portion of them was able to do the same in vitro when isolated from the complex cell environment made of chaperons, molecular crowding, local pH conditions and so forth. The presence of a continuum going from total solubility to aggregation into amorphous matrices when the proteins are observed in vitro tells us that the sequence/structure prediction is a well-posed consistent and autonomous problem only for a ©2009 Bentham Science Publishers Ltd.
236 Current Proteomics, 2009, Vol. 6, No. 4
minority of protein systems. Nevertheless, some proteins have the entire information for going from a linear arrangement of aminoacids to a well-defined structure inside their sequence and the search for the determinants of this process is of utmost theoretical interest. Folding intermediates (foldons) that structure themselves independently from the global system due to local aminoacid interactions, and so act as the promoter in a following phase of the large-scale structure formation was postulated [19-21]. Structural modules have also been recognized to have a capital role in many different protein molecules features: such as structural stability, enzymatic activity, specific binding sites for regulatory ligands, immunogenic properties, moreover it was hypothesized that these modules the responsible for the “functional communication” between allosteric protein sites [22-24]. In this work, we will explore the possibilities offered by a network–like approach to protein structures. This approach, thanks to its mesoscopic character (i.e. to the emphasis on the emergence of global features of the system from the consideration of the interactions present at a lower level description of the same system) gives a fresh look to the abovesketched themes. In the proposed approach, protein molecules are considered as networks, i.e. as graphs having the residues as nodes and the between residues contacts as edges. This formalization will allow for a straightforward use, in the realm of protein science, of a set of graph descriptors located both at the single node and at the entire network level. These descriptors were demonstrated to be extremely useful in many different fields of application ranging from social network to technological infrastructures (see [25-27] and references therein cited). We will put a special emphasis in the possibility, offered by the graph-based formalization, of giving a natural and unbiased definition of the concept of module, so shedding light on both allosteric effects and protein-protein interaction. We will use as case study the most ‘time honoured’ protein system: haemoglobin molecule, so to clarify some potentially interesting features of this new research avenue. 2. REVIEW OF METHODS Proteins as Networks: a Natural Perspective Proteins are polymers arising from the juxtaposition of single aminoacid residues along with a chain, hence they can be analysed focusing on the attention of the interaction existing among those residues representing the proteins by means of the interaction matrix existing between its constituents aminoacids. Formally, a protein composed by N aminoacid residues can be represented via an NxN squared matrix having as rows and columns the aminoacid residues [28-32] or groups of aminoacid residues [33, 34]. In the case when we use aminoacid residues and not groups of aminoacids some methods also sort residues accordingly to their location along the sequence. In our case, at the i,j location of this type of matrix we insert the specific relation holding between the two residues according to the particular level at which the protein is considered. In the analysis of primary sequence, this relation consists in the presence of a very similar hydrophobicity
Giuliani et al.
between the two i,j residues. This approach was applied in many different instances by means of the so-called Recurrence Quantification Analysis (RQA). In RQA, the interaction matrix is coded in terms of a binary representation named Recurrence Plot (RP) inserting a darkened pixel in the matrix whenever the two correspondent residues score a significant superposition in terms of the studied chemicophysical property [19, 35]. The number and location of the above-defined interactions are the basis for the computation of general descriptors that can be related to the proteins of interest, thus allowing the prediction of a number of relevant functional properties of these molecules [36]. If we carefully consider the nature of recurrence plots, we can immediately appreciate that they are fully equivalent to network systems in which the single residues (rows and columns of the RPs) correspond to the nodes and the darkened pixels (pointing to a relevant similarity between the correspondent residue pair) to a connecting edge. In other word the RP is nothing more nothing less than the network adjacency matrix, which is of common use in graph theory applied to the generation of network descriptors [23, 37-39]. On the other hand, the computation of invariant descriptors for these networks is very important (like the amount of Recurrence) that corresponds to a whole protein descriptor arising from a statistical computation executed at the microscopic single residues level [35]. This philosophy can be equated to the time-honoured Maxwell-Boltzmann approach that tried to link thermodynamic macrostates to microscopic based statistical measures like partition functions. From our viewpoint, the most interesting aspect of this approach is the fact that it relies on the consideration of proteins as graphs. In order to enlarge this perspective to protein structures intended as graphs, we need to give some general elucidations on graph theory. Graph Theory Graph theory gave rise to a set of methodologies for the analysis of complex networks; it applies to many targets (social networks, metabolic networks, biopolymers aggregates, biopolymer structures), thanks to an abstract formulation of the underlined problem. In this way, it is possible to exploit several mathematical tools in order to efficiently extract many relevant issues. The theory assumes to model any system (see as a network) in terms of the relations existing among its components, which in a chemical context may be atoms, nucleotides, aminoacids, gene, proteins. This drive to represent the network as a graph, i.e. a mathematical object composed by a set of V vertices or nodes (i.e. the components of the network) and E edges between them (i.e. the information about the existence of a relation among two given nodes). The edges may then represent then atom-atom chemical bonds, nucleotide-nucleotide hydrogen bonds, aminoacid-aminoacid electrostatic interactions, gen-gen coexpression, or protein-protein interactions [2, 40, 41]: G = G(V:E) An efficient way to codify a graph is by means of the adjacency matrix A [26]. Specifically, a graph composed by N nodes can be described by a square NxN matrix where the generic element aij is 1 if there is an edge between the vertex
Proteins as Networks: A Mesoscopic Approach Using Haemoglobin
vi and vj, is 0 otherwise. Hence, nodes vi and vj are named adjacent if there exists at least one edge eij connecting them; the degree of the i-th node ki is defined as the number of edges connected to the node vi: N
ki = aij j=1
The type of network is well represented through the degree distribution of its nodes, that gives the number of nodes N(k) that have the degree k [26]. Networks can be also identified by their inner structural order: regular networks are those whose nodes have all the same degree k, while random networks derive from regular ones, when a given fraction p of the edges are substituted through a random process. Randomness of such networks comprises of both the lack of it (p = 0, regular network) and the complete sense of it (p = 1, all the edges are randomly placed). For 0 < p < 1, the corresponding networks, generally speaking, show intermediate features: they are named small-world networks and they are well represented through two topological parameters: the average path length L and the network clustering coefficient C [42]. A key factor in the graph description is the shortest path, referring to the minimum number of edges that must be passed on to get from one vertex vi to another vj. The average value over all the possible vertices pairs is the average path, also referred to as the graph diameter, defined, as follows:
1 L= lij N ( N 1) i, jN i j Where, lij represents the shortest path between the nodes vi and vj; this parameter describes the communication patterns and efficiency of the network and it is directly linked to the ability of distant nodes to interact with each other. Strictly connected with the concept of short path are those of node centrality that measure the importance of a node in terms of how much short paths insist on it. Formally, it can be evaluated considering the betweenness centrality:
bi =
n jk (i ) 1 ( N 1) ( N 2) i, jN i j n jk
where njk is the shortest path between nodes j and k, while njk (i) is the shortest path between j and k, also passing on node i. Another quantity useful to characterise a network is the clustering coefficient C, which represents a measure of how much the neighbours of a given node are close together. To define the C coefficient, let us consider the generic vertex i, connected to other ki vertices, the maximum number of edges connecting all its neighbours is:
ki ( ki 1) 2 This is the ratio between the actual number of these edges and those actually present are the neighbouring is the vertex clustering coefficient Ci that is averaged all over the vertex to give the overall average clustering coefficient:
Current Proteomics, 2009, Vol. 6, No. 4
237
2 aij a jm ami 1 N C = Ci = N i=1 j,mN ki ( ki 1) These general descriptors can be measured both at the level of the entire graph and relatively to each single node so allowing for both a characterization of the entire network structure and of its own edges [26]. Network Analysis of Protein Structures and the Cartography of Modules Protein 3D structures can be represented as networks associating a topological graph to the protein structure information. Concerning the spatial position data as contained in the crystallographic data, we assume to consider as vertices of the graph the alpha carbons, and an (undirected) edge between the vertex i and j exists if the distance dij between these nodes is comprised within 4 Å and 8 Å. This threshold has been chosen to take into consideration the ‘effective’ contacts between residues without the disturbing ‘noise’ due to the trivial spatial proximities between residues adjacent along the sequence. In this way, there exists an isomorphism between the three-dimensional protein structures and an undirected graph, which contains the information concerning the topology of the molecular structure. The network topology of the protein structure is then represented through the adjacency matrix A, defined as follows:
aij = 0 dij I A= aij = 1 dij I where the lengths dij are in units of Angstrom and I = [4Å, 8Å]. This simple formalism was already demonstrated to fully encode the secondary structure of proteins in terms of specific patterns of contacts correspondent to beta sheets and alpha helixes [43]. In any case, other authors have used more elaborated schemes such as shifting and switching functions, which have a more slow decay, in order to perform a truncation of residue-residue contacts [44, 45]. A very important point of complex network is the presence of modules, i.e. the presence of groups of nodes that score a greater number of connections with other elements of the modules with respect to nodes pertaining to other modules. In order to detect the modules present in the system we used the classical method of spectral graph partitioning [46, 47], based on the analysis of the graph Laplacian matrix, defined as: L=D-A Where A is defined adjacency matrix and D is the degree matrix, i.e. a diagonal matrix whose non-null elements dii corresponds to the i-th node degree ki. In molecular structure analysis, the eigenvalues of L have been used for the analysis of RNA conformations [48]. The partition of graph in modules is based on the sign of the elements of the second (minor) eigenvector of the Laplacian matrix, the Fiedler vector: the positive elements indicate the node belonging to a subgraph, while the negative ones refer to those of the other one. This procedure is repeated iteratively until all the resulting eigenvectors elements have the same sign; in this way, the graph is divided in the maximum number of modules n M
238 Current Proteomics, 2009, Vol. 6, No. 4
Table 1.
Giuliani et al.
Classification of Nodes Following the Criteria of Guimera-Amaral Cartography
Criteria
z
P
Hub
R7: Ultra-peripheral nodes
< 2.5
< 0.05
non-hub
R2: Peripheral nodes
< 2.5
0.05 < P < 0.625
non-hub
R3: Non-hub connectors
< 2.5
0.625 < P < 0.8
non-hub
R4: Non-hub kinless nodes
< 2.5
> 0.8
non-hub
R5: Provincial hubs
> 2.5
< 0.3
hub
R6: Connector hubs
> 2.5
0.3 < P < 0.75
hub
R7: Kinless hubs
> 2.5
> 0.75
hub
that minimises the number of edges R running between two groups of vertices, and called cut size:
1 R = aij 2 i j The location of each residue to a specific module allows for a labelling of each node in terms of its pattern of connection. It determines a separation of the nodes that have a strong preference to connect with other elements INSIDE their module and the nodes that instead have a tendency to be at the ‘frontiers’ of their module and thus making a relevant number of connections with elements OUTSIDE their module. Guimera and Amaral defined a sort of network ‘cartography’ based on the definition of two parameters that indicates the intra- and extra-module connectivity for each node [49-51]. In details, with reference to the module Sa it is defined the within-module degree z-score as:
zi =
ki kSa Sa
where k is the number of links that the i-th node has Sa within its own module Sa, while is the standard deviaSa tion of k within the module Sa. On the other hand, it is possible to define the participation coefficient Pi for the i-th node, that describes the links that the i-th node establishes with modules other than Sa.
k Pi = 1 Sa k nM
s=1
3. HAEMOGLOBIN IN PARTICULAR Haemoglobin is a particularly convenient starting material to put into practice the above-mentioned ideas. First, haemoglobin is among the most analyzed proteins, so it is referred to by a rich repertoire of both structural and functional studies. For instance, many details of its activity have been explored, from the kinetics constants to the complete sequences of hundreds of different haemoglobins referring to a large number of different animal species, to the X-ray resolved structure of both physiological and pathological haemoglobins for different species [55-70]. From a structural point a view, haemoglobin (Fig. 1) is a tetrameric protein made of the duplication of a dimer constituted by two slightly different sequences called alpha and beta subunits. There is a delicate balance regulating the affinity for the oxygen of haemoglobin molecules. They must efficiently catch oxygen in the oxygen rich territories (lungs) and efficiently release oxygen in the more anaerobic environments (other tissues) while the opposite must be done for carbon dioxide. It is made possible by the cooperativity among subunits that by adjusting their relative positions make the haemoglobin to shift from a relaxed (R) to tense (T) configurations, endowed with a very different oxygen affinity. This phenomenon is called allostery (Greek word for ‘different place’) remarking
2
i
Based on the values of z and P for each node, it is possible to classify them in terms of their own module centrality; in details, the nodes are classified as hubs or not hubs due to their z value, while the P parameter defines their centrality in their own modules. In Table 1, the regions of P and z plane relevant for the cartography definition are sketched. Network cartography allowed the Nussinov group to give a topological based definition of allosteric effect in terms of ‘shortest path’ [52, 53]. Other groups were able to attach a graph theory based definition to other relevant protein molecules features [54].
Fig. (1). Three dimensional structure of adult haemoglobin A.
Proteins as Networks: A Mesoscopic Approach Using Haemoglobin
place’) remarking the fact a change of shape with a consequent change in affinity happens in a location well distant from the binding site. This implies that the ‘stimulus’ travels along the structure and exerts its action in distal portion of the molecule. This kind of behaviour was discovered for the first time in haemoglobin that still represents the most terse example of biological regulation. A remarkable feature of the model is that we can efficiently follow this complex molecular dynamics phenomenon by the appreciation of the so called ‘sigmoid’ curve linking the haemoglobin relative oxygen saturation to the oxygen pressure in the environment. As matter of fact, the MWC concerted model for the allosteric interactions between ligand binding sites was built up using as molecular models the different haemoglobin conformations: the allosteric interactions between the different haemoglobin forms are responsible of the protein adaptive behaviour to the environment stimuli (Haldane and Bohr effects). MWC term comes from the initials of Monod, Wyman and Changeux, the scientists who discovered and rationalised this model more than fifty years ago [71]. Thus the haemoglobin system endows all the basic ingredients we can hope to be able to model with a mesoscopic formalism like the one based on graph theory. Haemoglobin presents the emerging of a collective rearrangement of the system (allosteric effect) due to a local modification (oxygen or carbon dioxide binding). We found alsp wide interaction regions between different sequences, the presence of huge changes in functionality after relatively small changes (pathological mutations) in sequences going hand in hand with the maintaining functionality unchanged after much larger differences in the primary structure (inter-species differences, embryonic haemoglobin). Overall, we have a very complex system with collective behaviours that cannot be explained at the pure microscopic level and ask for an intermediate level of analysis connecting the microscopic aminoacid sequence level knowledge to the emerging properties of the whole system. Network Invariants of the Haemoglobin Protein Network As previously explained, it is possible to establish an isomorphic relationship between the spatial position of alpha carbons in the space, as reported in Protein Data Bank (PDB) [72, 73] format files, and a graph representation of the 3D protein structure. The 3D structure maps into the between residues adjacency matrix, as defined above, that, in spite of containing a compressed amount of structural information (only alpha carbons, yes/no definition of contacts without any quantitative appreciation of actual distances), yet is able to give essential hints about the secondary protein structure [43]. In this study, we try to go further and check if the contact matrix information not only gives us a correct reconstruction of secondary structure but allows us to go into tertiary structure details. Then we compared the graphs obtained by the analysis of three different haemoglobin structures: •
Adult physiological haemoglobin - - (PDB code 1HCO);
•
Embryonic Gower form - - (PDB code 1A9W);
•
Sickle cell haemoglobin -’- (PDB code 2HBS).
With respect to the mature physiological form, the Gower form is made of the so called globin instead of the one:
Current Proteomics, 2009, Vol. 6, No. 4
239
this results into an overall difference of 36 amino acids with respect to the mature haemoglobin. The sickle cell haemoglobin differs from the physiological form just for one point mutation in the chain, namely the glutamine at the location 6 is substituted by a Valine. The paradox (but it is a paradox only acquiring a very strict reductionist attitude) is that while Gower form is perfectly efficient, this is not the case for the sickle cell form. Trying and solve this paradox is, in our opinion, a good test for the proposed graph theory based method. Going into Tertiary Structure by Pure Contact Information We demonstrated that the huge compression of the information correspondent to go from the spatial coordinate of all the atoms to the yes/no contact matrix of alpha carbons alone still allows for the recognition of tertiary and quaternary structural features of the haemoglobin. In Fig. (2), we reported the clusterization by the Newman method [46, 47] of the haemoglobin contact matrix. The two colours point to the two-cluster solution that in turn corresponds with a good approximation to the two globins, i.e. to the actual subunits of the system. This result is very remarkable if we concentrate on the allocation of the interface residues that were correctly associated to the correct sequence notwithstanding the fact that, for the same definition of interface, they are located in between the two subunits and thus are very difficult to identify correctly in terms of correct subunit by a purely automatic method. In the graph theoretical formalism each node can be distinguished from the others on the sole basis of its topological location in the network, i.e. he can be discriminated by the others only by its pattern of relation with the other nodes [39, 74]. While a node located in the interior of one of the two subunits ‘sees’ only nodes of its own sub-system, by definition an interface node has more or less the same number of connections with both the sub-systems. Nevertheless, the module identification algorithm was able to assign correctly all the interface nodes (residues) to their sub-unit. This implies the presence of a general ‘signature’ or ‘global pattern’ encompassing the specific sub-unit that is in some way encrypted in the topological location of each aminoacid residue that is in turn recognized as a member of the system. Having proven the ability of the graph theoretical formalism to cor-
Fig. (2). Adult haemoglobin structure representation as a network: the two colors modules have been detected through a graph partitioning algorithm and they identify with a good precision the two globins and .
240 Current Proteomics, 2009, Vol. 6, No. 4
rectly identify subunits (tertiary feature) and the nature of the interface residues (quaternary feature), let us try to analyze
Giuliani et al.
the sensitivity of the graph theory approach as for identifying the topological counterpart of allosteric effect. In Figs. (3-6) the network properties for the adult physiological haemoglobin are shown: the resultant Poisson distribution for the node degree: it seems there are no highly connected nodes (hubs) like in the small-world architecture, but that most nodes have the same degree (distribution maximum value). This could be the effect of the ‘globularity’ of the studied molecule pointing to a generally symmetric shape of the system. Further, from the analysis of the node betweeness [26] distribution, a certain degree of symmetry for the two chain is evident (second chain starts at 148th node). This is well known, since the structure of the two globins is known to be quite similar, corresponding to similar functionalities. This is another proof-of-concept of the faithful conservation of structural properties by the topological network approach. Allosteric and Gower/Sickle Cell Anaemia Paradox
Fig. (3). Adult physiological haemoglobin network parameters: node betweeness distribution.
As for the mature haemoglobin network shortest path, it scores a very low value (11) that enforces the allosteric feature of the molecule. Indeed, as previously explained it has been evidenced that the allosteric behaviour proceeds though structural “shortcuts”, that can be interpreted in terms of graph shortest paths, connecting two nodes, corresponding to residues that are quite far in the protein three-dimensional structure. Analogous results have been found for the Gower form (Figs. 7-10), whose parameters distribution is quite indistinguishable from that of the physiological adult form: this is an interesting result, since the number of residues that change, with respect to the adult physiological form, is quite high, yet not producing determinant structural variations. On the other hand, some relevant differences emerge when the network parameters for the sickle cell haemoglobin are analysed (see Figs. 11-14): as matter of fact, the cluster coefficient distribution is spreader, and this is also reflected into the node degree distribution; that show a larger peak, with respect to the other structural form parameters. Results, in terms of higher-grade nodes, are summed up in Table 2. It
Fig. (4). Adult physiological haemoglobin network parameters: node clustering coefficient distribution.
Fig. (5). Adult physiological haemoglobin network parameters: node degree distribution.
Fig. (6). Adult physiological haemoglobin network parameters: shortest path lengths distribution.
Proteins as Networks: A Mesoscopic Approach Using Haemoglobin
Current Proteomics, 2009, Vol. 6, No. 4
241
very promising in the sensitivity of the network approach to give an explanation to the non-linear global consequences of mutations.
Fig. (7). Gower haemoglobin network parameters: node betweeness distribution. Fig. (10). Gower haemoglobin network parameters: shortest path lengths distribution.
Fig. (8). Gower haemoglobin network parameters: node clustering coefficient distribution. Fig. (11). Sickle cell haemoglobin network parameters: node betweeness distribution.
Fig. (9). Gower haemoglobin network parameters: node degree distribution. is evident, for all the three forms, the small-world character of the protein structure, evidenced by the low value of the mean shortest-path lengths. This is in striking contrast with mature physiological degree distribution, and the fact this global change derives from the change of only one residue is
Fig. (12). Sickle cell haemoglobin network parameters: node clustering coefficient distribution.
242 Current Proteomics, 2009, Vol. 6, No. 4
Giuliani et al.
Finally, as for the Guimera-Amaral cartography [49], the three structure analysis provides result of the P vs. z plot the expected “dentist-chair” shape, that has been already registered for a large number of protein structures. This highly invariant shape was observed for the adjacency matrixes of very different proteins going from fibrous to globular shapes [19, 23, 75] and is suggestive of a sort of ‘protein-like’ general behaviour of graphs that could point to still unknown energetic principles governing protein folding (see Figs. 1517).
Fig. (13). Sickle cell haemoglobin network parameters: node degree distribution.
Fig. (15). Adult physiological haemoglobin network partitioning parameters.
Fig. (14). Sickle cell haemoglobin network parameters: shortest path lengths distribution. Table 2.
Network Parameters for the Three Analysed Haemoglobin Structures: Higher-Grade Nodes are Evidenced for the Single Node Properties Adult Physiological Hb
Gower Hb
Sickle Cell Hb
Betweeness
108, 103, 35, 113
Diameter
11
11
11
Degree
14 (70)
13 (59)
14 (74)
Mean shortest path
5,14
5,17
5,16
Closeness
108, 103, 35, 108, 103, 35, 113 113
108, 103, 113, 108, 103, 113, 106 106
108, 103, 113, 106
Clustering coefficient
73
73
73
Efficiency
0,24
0,24
0,24
Fig. (16). Gower haemoglobin network partitioning parameters.
CONCLUSIONS The above analysis gave very promising albeit still preliminary results. First of all the possibility to obtain a faithful representation of three dimensional global structure of haemoglobin protein from the simple contact matrix points to the fact the network representation maintains the essential of
Proteins as Networks: A Mesoscopic Approach Using Haemoglobin
Current Proteomics, 2009, Vol. 6, No. 4 [5] [6] [7] [8]
[9] [10]
[11] [12]
Fig. (17). Sickle cell haemoglobin network partitioning parameters.
[13]
the protein structural information. This in turn allows the possibility to rationalize some well known physiological properties of haemoglobin as allosteric effect that, while very difficult to appreciate in structural terms, it obtains a direct quantization in terms of shortest path descriptor. This can be very helpful to establish sequence/structure relationships even in very ‘difficult’ settings like in the sickle cell anaemia case. On a more theoretical ground the invariance of ‘dentist chair’ pattern in the Guimera-Amaral space, with all the proteins giving rise to the same node distribution irrespective of their actual shape suggests the presence of still unknown energetic terms driving protein folding.
[14]
All in all the richness of information we can derive from a relatively simple and natural formalization of protein structural information, suggests the 'protein as network' paradigm as a useful mesoscopic model of proteins. Future work is needed for going ahead along this avenue, and a lot of improvements can be easily envisaged, first of all the need of labelling the nodes by the chemico-physical properties of the correspondent residues. On a methodological ground is of utmost importance to stress the big promises embedded in the contamination of different fields of expertise as in this case biochemistry and engineering / applied mathematics cultures.
[15] [16] [17] [18]
[19] [20]
[21]
[22] [23]
ACKNOWLEDGEMENTS The authors want to thank the Ylicron society to provide computational resource by means of online tool NAT (Network Analysis Tool - http://irriis.nat.ylichron.it/). REFERENCES [1]
[2] [3]
[4]
Chou, K.C. and Shen, H.B. Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms. Nat. Protoc., 2008, 3, 153-162. González-Díaz, H.; González-Díaz, Y.; Santana, L.; Ubeira, F.M. and Uriarte, E. Proteomics, networks and connectivity indices. Proteomics, 2008, 8, 750-778. Chen, Y.; Ding, F.; Nie, H.; Serohijos, A.W.; Sharma, S.; Wilcox, K.C.; Yin, S. and Dokholyan, N.V. Protein folding: then and now. Arch. Biochem. Biophys., 2008, 469, 4-19. Maity, H.; Maity, M.; Krishna, M.M.; Mayne, L. and Englander, S.W. Protein folding: the stepwise assembly of foldon units. Proc. Natl. Acad. Sci. USA, 2005, 102, 4741-4746.
[24]
[25]
[26] [27]
[28]
[29]
243
Creighton, T.E. Protein folding. Biochem. J., 1990, 270, 1-16. Yon, J.M. Protein folding in the post-genomic era. J. Cell Mol. Med., 2002, 6, 307-327. Leopold, P.E.; Montal, M. and Onuchic, J.N. Protein folding funnels: a kinetic approach to the sequence-structure relationship. Proc. Natl. Acad. Sci. USA, 1992, 89, 8721-8725. Dunker, A.K.; Brown, C.J. and Obradovic, Z. Identification and functions of usefully disordered proteins. Adv. Protein Chem., 2002, 62, 25-49. Dunker, A.K.; Cortese, M.S.; Romero, P.; Iakoucheva, L.M. and Uversky, V.N. Flexible nets. The roles of intrinsic disorder in protein interaction networks. FEBS. J., 2005, 272, 5129-5148. Dunker, A.K.; Lawson, J.D.; Brown, C.J.; Williams, R.M.; Romero, P.; Oh, J.S.; Oldfield, C.J.; Campen, A.M.; Ratliff, C.M.; Hipps, K.W.; Ausio, J.; Nissen, M.S.; Reeves, R.; Kang, C.; Kissinger, C.R.; Bailey, R.W.; Griswold, M.D.; Chiu, W.; Garner, E.C. and Obradovic, Z. Intrinsically disordered protein. J. Mol. Graph. Model., 2001, 19, 26-59. Uversky, V.N. Natively unfolded proteins: a point where biology waits for physics. Protein Sci., 2002, 11, 739-756. Fink, A.L. Natively unfolded proteins. Curr. Opin. Struct. Biol., 2005, 15, 35-41. Uversky, V.N.; Segel, D.J.; Doniach, S. and Fink, A.L. Association-induced folding of globular proteins. Proc. Natl. Acad. Sci. USA, 1998, 95, 5480-5483. Zbilut, J.P.; Colosimo, A.; Conti, F.; Colafranceschi, M.; Manetti, C.; Valerio, M.; Webber, C.L.Jr. and Giuliani, A. Protein aggregation/folding: the role of deterministic singularities of sequence hydrophobicity as determined by nonlinear signal analysis of acylphosphatase and Abeta(1-40). Biophys. J., 2003, 85, 3544-3557. Anfinsen, C.B. The formation and stabilization of protein structure. Biochem. J., 1972, 128, 737-749. Levinthal, C. Are there pathways for protein folding. J. Chim. Phys. Phys. Chim Biol., 1968, 65, 44-45. Zwanzig, R.; Szabo, A. and Bagchi, B. Levinthal's paradox. Proc. Natl. Acad. Sci. USA, 1992, 89, 20-22. Niwa, T.; Ying, B.W.; Saito, K.; Jin, W.; Takada, S.; Ueda, T. and Taguchi, H. Bimodal protein solubility distribution revealed by an aggregation analysis of the entire ensemble of Escherichia coli proteins. Proc. Natl. Acad. Sci. USA, 2009, 106, 4201-4206. Krishnan, A.; Zbilut, J.P.; Tomita, M. and Giuliani, A. Proteins as networks: usefulness of graph theory in protein science. Curr. Protein Pept. Sci., 2008, 9, 28-38. Panchenko, A.R.; Luthey-Schulten, Z.; Cole, R. and Wolynes, P.G. The foldon universe: a survey of structural similarity and selfrecognition of independently folding units. J. Mol. Biol., 1997, 272, 95-105. Englander, S.W.; Mayne, L. and Rumbley, J.N. Submolecular cooperativity produces multi-state protein unfolding and refolding. Biophys. Chem., 2002, 101-102, 57-65. Mete, M.; Tang, F.; Xu, X. and Yuruk, N. A structural approach for finding functional modules from large biological networks. BMC Bioinformatics, 2008, 9 Suppl, S19. Krishnan, A.; Giuliani, A.; Zbilut, J.P. and Tomita, M. Network scaling invariants help to elucidate basic topological principles of proteins. J. Proteome Res., 2007, 6, 3924-3934. Wang, Z. and Zhang, J. In search of the biological significance of modular structures in protein networks. PLoS Comput. Biol., 2007, 3, e107. Barabasi, A.L. and Oltvai, Z.N. Network biology: understanding the cell's functional organization. Nat. Rev. Genet., 2004, 5, 101113. Bornholdt, S. and Schuster, H.G. Handbook of Graphs and Complex Networks: From the Genome to the Internet, WILEYVCH GmbH & CO. KGa.: Wheinheim 2003. Albert, R. and Barabasi, A.L. Dynamics of complex systems: scaling laws for the period of boolean networks. Phys. Rev. Lett., 2000, 84, 5660-5663. González-Díaz, H.; Saiz-Urra, L.; Molina, R.; Santana, L. and Uriarte, E. A model for the recognition of protein kinases based on the entropy of 3D van der waals interactions. J. Proteome Res., 2007, 6, 904-908. Gonzalez-Diaz, H.; Saiz-Urra, L.; Molina, R.; Gonzalez-Diaz, Y. and Sanchez-Gonzalez, A. Computational chemistry approach to
244 Current Proteomics, 2009, Vol. 6, No. 4
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39] [40]
[41]
[42] [43] [44]
[45]
[46] [47]
[48]
protein kinase recognition using 3D stochastic van der Waals spectral moments. J. Comput. Chem., 2007, 28, 1042-1048. González-Díaz, H.; Pérez-Castillo, Y.; Podda, G. and Uriarte, E. Computational chemistry comparison of stable/nonstable protein mutants classification models based on 3D and topological indices. J. Comput. Chem., 2007, 28, 1990-1995. Concu, R.; Podda, G.; Uriarte, E. and Gonzalez-Diaz, H. Computational chemistry study of 3D-structure-function relationships for enzymes based on Markov models for protein electrostatic, HINT, and van der Waals potentials. J. Comput. Chem., 2009, 30, 1510-1520. Concu, R.; Dea-Ayuela, M.A.; Perez-Montoto, L.G.; BolasFernandez, F.; Prado-Prado, F.J.; Podda, G.; Uriarte, E.; Ubeira, F.M. and Gonzalez-Diaz, H. Prediction of enzyme classes from 3D structure: A general model and examples of experimental-Theoretic scoring of peptide mass fingerprints of leishmania proteins. J. Proteome Res., 2009, 8, 4372-4382. Aguero-Chapin, G.; Varona-Santos, J.; de la Riva, G.A.; Antunes, A.; Gonzalez-Villa, T.; Uriarte, E. and Gonzalez-Diaz, H. Alignment-free prediction of polygalacturonases with pseudofolding topological indices: experimental isolation from coffea arabica and prediction of a new sequence. J. Proteome Res., 2009, 8, 2122-2128. Agüero-Chapin, G.; Gonzalez-Diaz, H.; Molina, R.; VaronaSantos, J.; Uriarte, E. and Gonzalez-Diaz, Y. Novel 2D maps and coupling numbers for protein sequences. The first QSAR study of polygalacturonases; isolation and prediction of a novel sequence from Psidium guajava L. FEBS Lett., 2006, 580, 723-730. Giuliani, A.; Benigni, R.; Zbilut, J.P.; Webber, C.L.Jr.; Sirabella, P. and Colosimo, A. Nonlinear signal analysis methods in the elucidation of protein sequence-structure relationships. Chem. Rev., 2002, 102, 1471-1492. Giuliani, A.; Sirabella, P.; Benigni, R. and Colosimo, A. Mapping protein sequence spaces by recurrence quantification analysis: a case study on chimeric structures. Protein Eng., 2000, 13, 671-678. Marrero-Ponce, Y.; Medina-Marrero, R.; Castillo-Garit, J.A.; Romero-Zaldivar, V.; Torrens, F. and Castro, E.A. Protein linear indices of the 'macromolecular pseudograph alpha-carbon atom adjacency matrix' in bioinformatics. Part 1: prediction of protein stability effects of a complete set of alanine substitutions in Arc repressor. Bioorg. Med. Chem., 2005, 13, 3003-3015. Marrero-Ponce, Y.; Medina-Marrero, R.; Castro, A.E.; Ramos de Armas, R.; González-Díaz, H.; Romero-Zaldivar, V. and Torrens, F. Protein quadratic indices of the “Macromolecular Pseudograph’s -Carbon Atom Adjacency Matrix”. 1. Prediction of Arc repressor alanine-mutant’s stability. Molecules, 2004, 9, 1124-1147. Estrada, E. and Rodriguez-Velazquez, J.A. Subgraph centrality in complex networks. Phys. Rev. E., 2005, 71, 056103. González-Díaz, H.; Vilar, S.; Santana, L. and Uriarte, E. Medicinal chemistry and bioinformatics – current trends in drugs discovery with networks topological indices. Curr. Top. Med. Chem., 2007, 7, 1025-1039. Gonzalez-Diaz, H.; Prado-Prado, F. and Ubeira, F.M. Predicting antimicrobial drugs and targets with the MARCH-INSIDE approach. Curr. Top. Med. Chem., 2008, 8, 1676-1690. Watts, D.J. and Strogatz, S.H. Collective dynamics of 'small-world' networks. Nature, 1998, 393, 440-442. Webber, C.L.Jr.; Giuliani, A.; Zbilut, J.P. and Colosimo, A. Elucidating protein secondary structures using alpha-carbon recurrence quantifications. Proteins, 2001, 44, 292-303. Saiz-Urra, L.; González-Díaz, H. and Uriarte, E. Proteins Markovian 3D-QSAR with spherically-truncated average electrostatic potentials. Bioorg. Med. Chem., 2005, 13, 3641-3647. González-Díaz, H.; Saíz-Urra, L.; Molina, R. and Uriarte, E. Stochastic molecular descriptors for polymers. 2. Spherical truncation of electrostatic interactions on entropy based polymers 3D-QSAR. Polymers, 2005, 46, 2791-2798. Newman, M.E. Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. E. Stat. Nonlin. Soft Matter Phys., 2006, 74, 036104. Newman, M.E. and Girvan, M. Finding and evaluating community structure in networks. Phys. Rev. E. Stat. Nonlin. Soft Matter Phys., 2004, 69, 026113. Barash, D. Second eigenvalue of the Laplacian matrix for predicting RNA conformational switch by mutation. Bioinformatics, 2004, 20, 1861-1869.
Giuliani et al. [49]
[50] [51]
[52]
[53] [54]
[55]
[56]
[57]
[58]
[59]
[60]
[61] [62]
[63]
[64] [65]
[66] [67]
[68] [69]
Guimera, R. and Amaral, L.A. Cartography of complex networks: modules and universal roles. J. Stat. Mech., 2005, 2005, nihpa35573. Guimera, R.; Sales-Pardo, M. and Amaral, L.A. Module identification in bipartite and directed networks. Phys. Rev. E. Stat. Nonlin. Soft Matter Phys., 2007, 76, 036102. Guimera, R.; Sales-Pardo, M. and Amaral, L.A. Modularity from fluctuations in random graphs and complex networks. Phys. Rev. E. Stat. Nonlin. Soft Matter Phys., 2004, 70, 025101. Del Sol, A.; Arauzo-Bravo, M.J.; Amoros, D. and Nussinov, R. Modular architecture of protein structures and allosteric communications: potential implications for signaling proteins and regulatory linkages. Genome Biol., 2007, 8, R92. del Sol, A.; Fujihashi, H.; Amoros, D. and Nussinov, R. Residues crucial for maintaining short paths in network communication mediate signaling in proteins. Mol. Syst. Biol., 2006, 2, 2006 0019. Vendruscolo, M.; Dokholyan, N.V.; Paci, E. and Karplus, M. Small-world view of the amino acids that play a key role in protein folding. Phys. Rev. E. Stat. Nonlin. Soft Matter Phys., 2002, 65, 061910. Sathya Moorthy, P.; Neelagandan, K.; Balasubramanian, M. and Ponnuswamy, M.N. Purification, crystallization and preliminary Xray diffraction studies on avian haemoglobin from pigeon (Columba livia). Acta Crystallogr. Sect. F. Struct. Biol. Cryst. Commun., 2009, 65, 120-122. Fonseca, J.C.; Honda, R.T.; Delatorre, P.; Fadel, V.; BonillaRodriguez, G.O. and de Azevedo, W.F.Jr. Crystallization, preliminary X-ray analysis and molecular-replacement solution of haemoglobin-II from the fish matrinxa (Brycon cephalus). Acta Crystallogr. D. Biol. Crystallogr., 2003, 59, 752-754. Pesce, A.; Nardini, M.; Dewilde, S.; Ascenzi, P.; Riggs, A.F.; Yamauchi, K.; Geuens, E.; Moens, L. and Bolognesi, M. Crystallization and preliminary X-ray analysis of neural haemoglobin from the nemertean worm Cerebratulus lacteus. Acta. Crystallogr. D. Biol. Crystallogr., 2001, 57, 1897-1899. Honda, R.T.; Delatorre, P.; Fadel, V.; Canduri, F.; Dellamano, M.; de Azevedo, W.F.Jr. and Bonilla-Rodriguez, G.O. Crystallization, preliminary X-ray analysis and molecular-replacement solution of the carboxy form of haemoglobin I from the fish Brycon cephalus. Acta. Crystallogr. D. Biol. Crystallogr., 2000, 56, 1685-1687. Deepthi, S.; Johnson, A.; Sathish, R. and Pattabhi, V. Purification, crystallisation and preliminary X-ray study of haemoglobin from Crocodilis palustris and Crocodilis porosus. Biochim. Biophys. Acta., 2000, 1480, 384-387. Smarra, A.L.; de Azevedo, W.F.Jr.; Fadel, V.; Delatorre, P.; Dellamano, M.; Colombo, M.F. and Bonilla-Rodriguez, G.O. Purification, crystallization and preliminary X-ray analysis of haemoglobin I from the armoured catfish Liposarcus anisitsi. Acta. Crystallogr. D. Biol. Crystallogr., 2000, 56, 495-497. Luisi, B.F.; Nagai, K. and Perutz, M.F. X-ray crystallographic and functional studies of human haemoglobin mutants produced in Escherichia coli. Acta Haematol., 1987, 78, 85-89. Eisenberger, P.; Shulman, R.G.; Kincaid, B.M.; Brown, G.S. and Ogawa, S. Extended X-ray absorption fine structure determination of iron nitrogen distances in haemoglobin. Nature, 1978, 274, 3034. Finean, J.B.; Freeman, R. and Coleman, R. X-ray diffraction patterns from haemoglobin-free erythrocyte membranes. Nature, 1975, 257, 718-719. Magdoff-Fairchild, B.; Swerdlow, P.H. and Bertles, J.F. Intermolecular organization of deoxygenated sickle haemoglobin determined by x-ray diffraction. Nature, 1972, 239, 217-219. Perutz, M.F.; Bolton, W.; Diamond, R.; Muirhead, H. and Watson, H.C. Structure of haemoglobin. An X-Ray examination of reduced Horse haemoglobin. Nature, 1964, 203, 687-690. Perutz, M.F. and Mazzarella, L. A preliminary X-Ray analysis of haemoglobin H. Nature, 1963, 199, 639. Perutz, M.F.; Rossmann, M.G.; Cullis, A.F.; Muirhead, H.; Will, G. and North, A.C. Structure of haemoglobin: a three-dimensional Fourier synthesis at 5.5-A. resolution, obtained by X-ray analysis. Nature, 1960, 185, 416-422. Bragg, L. X-ray analysis of the haemoglobin molecule. Proc. R. Soc. Lond B. Biol. Sci., 1953, 141, 67-69. Perutz, R.R.; Liquori, A.M. and Eirich, F. X-ray and solubility studies of the haemoglobin of sickle-cell anaemia patients. Nature, 1951, 167, 929-931.
Proteins as Networks: A Mesoscopic Approach Using Haemoglobin [70]
[71] [72]
Dodson, E. and Dodson, G. Movements at the haemoglobin ahaems and their role in ligand binding, analysed by x-ray crystallography. Biopolymers, 2009, 91, 1056-1063. Monod, J.; Wyman, J. and Changeux, J.P. On the nature of allosteric transitions: a plausible model. J. Mol. Biol., 1965, 12, 88118. Claverie, J.M. and Sauvaget, I. A new protein sequence data bank. Nature, 1985, 318, 19.
Received: June 16, 2009
Current Proteomics, 2009, Vol. 6, No. 4 [73]
[74] [75]
245
Bernstein, F.C.; Koetzle, T.F.; Williams, G.J.; Meyer, E.F.Jr.; Brice, M.D.; Rodgers, J.R.; Kennard, O.; Shimanouchi, T. and Tasumi, M. The Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol., 1977, 112, 535-542. Estrada, E. Virtual identification of essential proteins within the protein interaction network of yeast. Proteomics, 2006, 6, 35-40. Krishnan, A.; Giuliani, A.; Zbilut, J.P. and Tomita, M. Implications from a network-based topological analysis of ubiquitin unfolding simulations. PLoS ONE, 2008, 3, e2149.
Revised: September 01, 2009
Accepted: September 17, 2009