Draft of Nov. 6, 2013. Published chapter: dx.doi.org/10.1007/978-1-4614-9521-5_9
Chapter 9
Using Component Trees to Explore Biological Structures Lucas M. Oliveira, T. Yung Kong and Gabor T. Herman
Abstract An understanding of the three-dimensional structure of a macromolecular complex is essential to fully understand its function. This chapter introduces the reader to the concept of a component tree, which is a compact representation of the structural properties of a multidimensional image (such as a molecular density map of a biological specimen), and then presents ongoing research on the use of such component trees in interactive tools for exploring biological structures. Component trees capture essential structural information about a biological specimen, irrespective of the process that was used to obtain an image of the specimen and the resolution of that image. We present various scenarios in which component trees can help in the exploration of the structure of a macromolecular complex. In addition, we discuss ideas for a docking methodology that uses component trees.
9.1 Introduction Three-dimensional (3D) structural studies of biological matter are of great importance for full understanding of the function and evolution of macromolecular complexes and organelles within cells. For example, the 3D structure of a cellular comLucas M. Oliveira Computer Science Ph.D. Program, Graduate Center, City University of New York, 365 Fifth Avenue, New York, NY 10016, USA, e-mail:
[email protected] T. Yung Kong Computer Science Department, Queens College, City University of New York, 65-30 Kissena Boulevard, Flushing, NY 11367, USA, e-mail:
[email protected] Gabor T. Herman Computer Science Ph.D. Program, Graduate Center, City University of New York, 365 Fifth Avenue, New York, NY 10016, USA, e-mail:
[email protected]
1
2
Lucas M. Oliveira, T. Yung Kong and Gabor T. Herman
ponent is closely related to its function within a cell, and knowledge of both structure and function is necessary for applications such as the design of drugs whose targets are particular proteins. Several complementary techniques have been developed for determining the 3D structure of biological specimens. These techniques reveal different details of the macromolecular structure. X-ray crystallography, for example, is a method of determining the arrangement of atoms within a crystal (by striking the crystal with a beam of X-rays and analyzing the resulting diffraction patterns). X-ray crystallography produces high-resolution information and an atomic model of an imaged biological specimen. The atomic model is useful for revealing important structural details of macromolecular subunits and assigning functional properties to macromolecular assemblies. Such atomic models are often made publicly available by depositing them in the Protein Data Bank (PDB) [2]. However, a large number of macromolecules cannot be imaged by X-ray crystallography because they diffract poorly or cannot be crystallized. In such cases, 3D transmission electron microscopy (TEM) techniques can be chosen to investigate the structure. In contrast with X-ray crystallography, which determines the electron density distribution of a sample, electron microscopy yields a representation of molecular densities (through projections of the Coulomb potential) for a specimen that has been imaged [17]. Density maps (which are 3D arrays of real numbers) are produced from observations of elastic interactions of electrons with the atomic composition of the sample. Such density maps are often made publicly available by depositing them in the Electron Microscopy Data Bank (EMDB) [14]. Cryo-electron microscopy (cryo-EM), sometimes called electron cryomicroscopy, is a form of TEM where the sample is studied at cryogenic temperatures (generally liquid nitrogen temperature) to preserve the native environment of the specimen. Cryo-EM has proved to be indispensable for producing reliable images of intact biological structures [31]. Single-particle cryo-EM reconstruction, which determines the structure of a macromolecular complex from projection images, has become an essential technique in structural biology and is being used to determine structures of large macromolecules, macromolecular complexes and cell components involved in many biological processes, including signal transduction, genome replication, transcription and viral infection. Figure 9.1a shows a surface rendering from a density map obtained by single-particle reconstruction of bacterial chaperonin GroEL ˚ (EMDB access code 1080; the claimed resolution is 11.5 A). It is also quite common to create density maps from the atomic models that are provided by X-ray crystallography or NMR spectroscopy. Here the density map is constructed by combining the contribution of every atom in the model (based on the types and locations of the atoms). Because of the desire for creating these kinds of maps in a variety of research projects, there are several software packages available for converting atomic models into density maps; for example, BSoft [10], UCSF Chimera [18], Situs [32] and Xmipp [23]. In the context of this chapter, such conversion is necessary, because the various techniques that are discussed are all for density maps rather than for atomic models. Figure 9.1b shows a surface rendering
9 Using Component Trees to Explore Biological Structures
a
3
b
Fig. 9.1: Surface renderings of GroEL. a Density map obtained by single-particle reconstruction of bacterial chaperonin GroEL (EMDB access code 1080; the claimed ˚ b Density map created from an atomic model of wildtype apo resolution is 11.5 A). ˚ using the program GroEL (PDB ID: 2NWC [13]; the claimed resolution is 3.02 A) MolMap in the software package UCSF Chimera [18]
of a density map created from an atomic model of wildtype apo GroEL (PDB ID: ˚ 2NWC [13]; the claimed resolution is 3.02 A). Several methods are commonly used to visualize biological structures and produce useful information for understanding their function and evolution. Many of these methods are based on visual representations of 3D density maps that can be interactively explored. Surface renderings, slices, and volume renderings are the most commonly used kinds of visual representation. In this chapter we describe a topological/geometric image descriptor called a component tree, and show how visualization tools based on these trees can be useful in exploring macromolecular structures.
9.2 Component Trees 9.2.1 Digital Pictures A digital picture consists of (i) a set of elements called picture elements, (ii) an assignment of a nonnegative intensity level (or density level or graylevel) to each picture element, and (iii) an adjacency relation on the set of picture elements. For example, in a 3D structural study we typically cover a region of space that contains
4
Lucas M. Oliveira, T. Yung Kong and Gabor T. Herman
the structure of interest with a contiguous array of cubes, customarily called voxels (short for volume elements); these are the picture elements for this particular case. Two (distinct) voxels are considered to be adjacent if they share one whole face (this is commonly referred to as face-adjacency [7]; unless otherwise stated, it is the adjacency that is assumed in this chapter). The assignment of a density to each voxel is achieved by some imaging process, such as single-particle reconstruction. The resulting 3D array of voxel densities is what we have been referring to until now as a density map; often it is also referred to as a (digital) image.1 The distance between the centers of adjacent voxels is called the voxel spacing, it is the same as the length of any of the edges of the cubic voxels. In deciding the voxel spacing to be used in representing a structure as a digital picture, one should take into consideration the physical resolution of the imaging process by which we obtain the densities to be assigned to the voxels.
9.2.2 Overview of Component Trees A component tree is a descriptor which manifests structural relationships between different parts of an image. Appropriately simplified component trees capture essential structural information about a biological specimen in a way that is independent of the resolution of its density map and the process (e.g., cryo-EM or X-ray crystallography) used to obtain that map. They are compact descriptors which can represent in several thousand bits a 3D density map that consists of billions of bits. In the language of discrete mathematics, a component tree is a rooted tree2 in which each node is labeled with a “level” that is related (in a manner made precise
1
In this chapter we distinguish between digital images and digital pictures. The difference between the two is that a digital picture has an adjacency relation but a digital image does not: A digital picture can be regarded as the result of equipping a digital image (which we may refer to as its underlying image) with an adjacency relation on the picture elements. Digital pictures which have the same underlying image but have different adjacency relations are considered to be different digital pictures. For example, in the case when the picture elements are voxels, we could have chosen the adjacency relation to be the face-or-edge-adjacency that exists between two voxels if they share either exactly one face or exactly one edge [7]. The component tree of a digital picture will depend on its adjacency relation as well as its underlying image. 2 A rooted tree T is a pair (N, E), where N is a finite set of nodes and E is a set of edges. Each edge is an ordered pair of distinct nodes that are respectively called the parent node and the child node of the edge, and the nodes and edges satisfy the following conditions: (1) every member of N, except one element called the root, is a child node of just one edge; (2) the reflexive transitive closure of E is a partial order on N. If x and y are nodes such that x = y or x precedes y in the partial order, then x is called an ancestor of y and y is called a descendant of x. In particular, every node in N is a descendant of the root. We say x is a proper ancestor (respectively, proper descendant) of y if x is an ancestor (respectively, a descendant) of y and x 6= y.
9 Using Component Trees to Explore Biological Structures
5
later in this chapter) to the densities in the density map that is being represented by the component tree.3 Figure 9.2 shows a surface rendering, a central slice, and a simplified component tree of microtubule binding patterns of dimeric kinesins [11]. The component tree captures the essential structure of the macromolecular assembly. There are fifteen subtrees like the one indicated by the red oval in Figure 9.2d, each of which corresponds to one of the fifteen vertical sections of the microtubule and the kinesins attached to that section. (For example, the subtree indicated by the red oval corresponds to the part of Figure 9.2e that is within the red mesh.) The four leaves4 of the tree that are indicated by the purple, yellow, blue, and green arrows in 9.2d respectively correspond to the purple, yellow, blue, and green kinesins in Figure 9.2e.5 It is clear from Fig. 9.2 that component trees translate a complex 3D structure into a much simpler structure. As will be demonstrated in this chapter, a component tree is an image descriptor that can be useful for exploring biological structures in several different ways. For example, given a density map and its component tree, a user can produce several density map segmentations by interactive selection of specific parts of the component tree. These segmentations can be useful in understanding the relationship between different subunits in a macromolecule or even the function of a specific part of a macromolecule. Another potential application of component trees is in macromolecular docking (also called macromolecular fitting). Suppose that we have a low-resolution density map for a macromolecule and a high-resolution density map for a subunit of this macromolecule. Since component trees capture the essence of the macromolecular structures represented in the density maps, the component tree of the high-resolution density map should have approximately the same tree structure as that part of the component tree of the whole macromolecule which corresponds to the subunit. In this context our goal is to develop algorithms that will automatically find the relevant part of the latter component tree.
9.2.3 Component Tree of a 1-Dimensional Digital Picture As a first and very simple illustration of how a component tree can be created from a digital picture, we explain the construction of the component tree of the 1-dimensional digital picture shown in Fig. 9.3. 3
Component trees are very similar to the foreground history trees of [22] and the join trees of [3]. They are also related to contour trees [3]; the relationship between contour and component trees is discussed in Sec. 2.7.1 of [8]. 4 A node of a rooted tree is called a leaf if it has no children. 5 In Figure 9.2c and d, each node that has two or more children (such as the root of the subtree that is highlighted in panel d) is represented by a horizontal segment, and an edge from a node to one of its children is represented by a vertical segment whose length is proportional to the difference between the levels (see later for definition) of those two nodes.
6
Lucas M. Oliveira, T. Yung Kong and Gabor T. Herman
a
b
d
c
e
Fig. 9.2: Component Tree of a Macromolecular Structure. a Surface rendering, b Central slice, and c Simplified component tree of a density map of the microtubule binding pattern of dimeric kinesins (density map with EMDB code 1032). A subtree of the tree in c is highlighted in d; this subtree corresponds to the part of the surface rendering that is inside the red mesh in e. The four leaves of the subtree that are indicated by the colored arrows in d correspond to four colored kinesins in e.
This digital picture contains just a single row of 37 pixels (this the commonlyused name for square-shaped picture elements [7]); the intensity of each pixel is indicated by the number above that pixel. For example, the intensities of the four leftmost pixels are respectively 0, 3, 14, and 14. The adjacency relation of the digital picture is edge adjacency—two pixels are considered to be adjacent if they share an edge. Thus each pixel except the first and the last is adjacent to just two pixels: one on its left and one on its right. The first and the last pixels are each adjacent to just one pixel.
9 Using Component Trees to Explore Biological Structures
7
Fig. 9.3: A Simple Digital Picture. This 1-dimensional digital picture will be used to illustrate the construction of a component tree. The picture contains 37 pixels, each of which is labeled with its intensity. Two pixels are considered to be adjacent if they share an edge
The component tree of this digital picture is shown in Fig. 9.4a. Note that the digital picture is reproduced at the top of Fig. 9.4a; this is to make it easier to verify certain relationships between tree and picture that we will state below. Each node of the component tree is a set of picture elements of the digital picture; the cardinality of a node is the number of picture elements in that set: For example, we see from Fig. 9.4a that the node v0 has cardinality 37—it is just the set of all 37 pixels. We also see that the node v1 has cardinality 36 (as it is the set of all pixels other than the leftmost pixel), and that the node v20 has cardinality two (as it consists of the 2nd and the 3rd pixels from the right). However, we often draw component trees more simply, by showing each node as a point rather than a set of picture elements. Figure 9.4b shows the same component tree in this simplified way. Every node of the tree has a level; the level of any node v is defined to be the minimum of the intensity levels of the picture elements in v. In Fig. 9.4a, the levels of the nodes are indicated by the numbers beside the vertical bar on the left. For example, we see at once that the level of the node v9 is 10. (It is also easy to verify that this is correct: The node v9 consists of eight pixels whose intensity levels are 14, 14, 12, 14, 14, 10, 12, and 12, and the minimum of these intensity levels is indeed 10.) We now describe an easy way to create the component tree. This will involve thresholding the digital picture at every intensity level t that occurs in the picture. (For any intensity level t, we threshold a digital picture at the level t by omitting all the picture elements whose intensity is < t. In other words, we retain only the set of picture elements whose intensity is ≥ t. Each maximal connected fragment (i.e., a connected fragment that is not included in a larger connected segment) of the latter set is called a connected component or just a component of the set; if the set is disconnected, then it will consist of two or more connected components. For example, if we threshold the picture of Fig. 9.3 at level t = 16 then just six pixels will be retained—the 2nd , 3rd , 5th , 6th , 8th , and 9th pixels from the right—and those six pixels will fall into three components each of which consists of just two adjacent pixels.) The method of constructing the component tree that we will describe can be understood as a natural and direct way to produce a rooted tree which has the following properties: 1. For every intensity level t, each connected component of the set of picture elements that have intensity ≥ t is a tree node of level ≥ t. All nodes of the tree can
8
Lucas M. Oliveira, T. Yung Kong and Gabor T. Herman
a
b
Fig. 9.4: Detailed and Simplified Representations of a Component Tree. a Component tree of the digital picture of Fig. 9.3, shown in full detail—for each node, the node’s level and the pixels which constitute that node are shown. b Simplified drawing of the same tree in which each node is shown just as a point. (Panel a is reproduced from [8])
be obtained in this way. The root of the tree is the set consisting of every picture element of the picture (regardless of the picture element’s intensity level).
9 Using Component Trees to Explore Biological Structures
9
2. A node u is a proper ancestor of a node v just if6 u is a proper superset of v. The term proper ancestor in property 2 was defined in footnote2 for any rooted tree, but can be defined more informally as follows: A node x is an ancestor of a node y if x lies on the simple path from the tree’s root to y; any ancestor of y other than y itself is called a proper ancestor of y. For example, the proper ancestors of the node v23 in Fig. 9.4 are v20 , v15 , v5 , v3 , v1 , and the root v0 . As mentioned above, the direct way of constructing a component tree that will be described below involves thresholding the digital picture at every intensity level which occurs in the picture. Another way to construct component trees is presented in [16]. The algorithm of [16] does not involve thresholding, and is computationally more efficient when applied to digital pictures that have many intensity levels. It processes the picture elements in decreasing order of their intensity and uses Tarjan’s union-find algorithm [28] to build the tree from the bottom up.
How We Can Find the Nodes of the Component Tree Let I be the digital picture, and let ` be the set of all the intensity levels that occur in I. The nodes of the component tree of the digital picture I can be found by thresholding I at each of the graylevels in `. At each threshold level t ∈ `, we find the picture elements whose intensity levels are ≥ t and then find the connected components of that set of picture elements. Each such connected component is one node of the component tree. As stated above, we define the level of that node to be the minimum of the intensity levels of the picture elements in the component. Every node of the component tree can be obtained in this way. However, the level of a node that is found when I is thresholded at intensity level t need not be t: Such a node may have level t 0 > t, in which case that very same node will also be found when I is thresholded at any other intensity level between t and t 0 .
Examples of How Nodes are Found Consider in Fig. 9.3 the threshold level t = 0. In this case all 37 pixels of I have intensity ≥ t = 0, so the set of pixels with intensity ≥ t = 0 has just one connected component (namely the entire set of 37 pixels). Thus thresholding the picture I at level t = 0 yields just one node of the tree. This node is shown as v0 in Fig. 9.4a. The node’s level (i.e., the minimum of the intensities of its pixels) is 0 because the intensity of the leftmost pixel of I is 0. This node is the root node of the component tree. Now let us consider the threshold level t = 1. In this case all but one of the 37 pixels of I have intensity levels ≥ t = 1; the only exception is the leftmost pixel, whose intensity is 0. This set of 36 pixels also has only one component. Thus thresholding 6
In this chapter “just if” is used with its precise mathematical meaning—for any two statements P and Q, the statement “P just if Q” means “P is true if, and only if, Q is true”.
10
Lucas M. Oliveira, T. Yung Kong and Gabor T. Herman
the picture I at level 1 yields just one node of the tree. This node (of cardinality 36) is shown as v1 in Fig. 9.4a. The minimum of the intensities of the pixels in this node is 1 (because the intensity of the 18th pixel is 1), so this node has level 1. Thresholding the picture I at the next level in `, namely the level t = 3, yields two nodes of the tree that have cardinalities 16 and 19. This is because the set of pixels with intensity ≥ 3 consists of the two components labeled v2 and v3 in Fig. 9.4a, which are separated by a pixel whose intensity is 1. In each of the two components the pixel of lowest intensity has intensity 3, so each of the two nodes has level 3. The next threshold level in ` is t = 6. The reader should now have no difficulty in verifying that thresholding I at level t = 6 yields just two nodes of the tree, both of which have level 6. These nodes, which have cardinalities 15 and 18, are labeled v4 and v5 in Fig. 9.4a. For the threshold levels t that have been considered so far, the component tree nodes that are found when we threshold I at intensity level t have also had level t. But this is not true when we use the threshold level t = 7. The threshold level t = 7 will yield five component tree nodes because the set of pixels with intensity ≥ 7 consists of five components. But only one of these five nodes will have a level that is equal to t; the levels of the other four nodes will be higher. Indeed, the leftmost component of the set of pixels with intensity ≥ 7, labeled v9 in Fig. 9.4a, consists of eight pixels with intensities 14, 14, 12, 14, 14, 10, 12, and 12; this will therefore be a component tree node whose level is 10. Another component, labeled v7 , consists of five pixels with intensities 14, 14, 8, 9, and 10; this will therefore be a component tree node whose level is 8. A third component, labeled v6 , consists of four pixels with intensities 7, 7, 12, and 12; this will be a node whose level is 7. A fourth component, labeled v14 , consists of two pixels that both have intensity 12; this will be a node whose level is 12. The fifth component, labeled v15 , consists of eight pixels with intensities 18, 18, 13, 18, 18, 12, 16, and 18; this too will be a node whose level is 12.
How We Can Find the Edges of the Component Tree The edges of the component tree connect nodes at different levels in a way that reflects the inclusion relationships between nodes. Specifically, there is an edge from a node v to a node u of higher level just if v is the node of highest level such that v ) u (i.e., just if the set of picture elements v strictly contains the set of picture elements u and there is no node of higher level than v that strictly contains u). The reader can easily verify that the edges shown in Fig. 9.4a are exactly the edges given by this rule. For example, we see from Fig. 9.4a that v20 is the node of highest level which strictly contains the node v23 , and so there is an edge from v20 to v23 in the tree.
9 Using Component Trees to Explore Biological Structures
a
b
11
c
Fig. 9.5: Creation of the 2D Digital Picture of Sec. 9.2.4. a Surface rendering of a three-dimensional reconstruction (from cryo-electron microscopy images) of helicase DnaB (EMDB access code 1022). b A central slice of the density map. c The reduced version of b that is used in Sec. 9.2.4: In c, each pixel corresponds to a 5 × 5 region of the image in b and pixel intensity levels have been quantized to to five values 0, 1, 2, 3, and 4 (which are respectively shown as black, dark gray, gray, light gray, and white). In Sec. 9.2.4, c is regarded as a digital picture in which the adjacency relation is edge adjacency, so that each pixel is adjacent just to those other pixels which share an edge with it
9.2.4 Component Tree of a 2-Dimensional Digital Picture The above process of creating a component tree is valid for digital pictures of any dimension. To give an additional example of how component trees are constructed, this time using an image that is derived from a real biological structure, we describe the creation of a component tree of the digital picture that is shown in Fig. 9.5c. The digital picture in Fig. 9.5c is a simplified version of a central slice of a three-dimensional reconstruction (from cryo-electron microscopy images) of helicase DnaB (EMDB access code 1022). Figure 9.5a shows a surface rendering of the three-dimensional density map. Figure 9.5b shows a cropped part of the original central slice, which contains 50 × 50 pixels. This was simplified to Fig. 9.5c, which contains 10 × 10 pixels, by replacing 5 × 5 arrays of pixels with single pixels whose intensities were obtained by averaging the intensities of the pixels in the corresponding arrays, and by quantizing pixel intensity levels to a set of five equally spaced values represented by the integers 0, . . . , 4. In Fig. 9.5c, the intensity levels 0, 1, 2, 3, and 4 are respectively shown as black, dark gray, gray, light gray, and white. We regard Fig. 9.5c as a digital picture in which the adjacency relation within the set of pixels is edge adjacency: Distinct pixels are considered to be adjacent just if they share an edge.
12
Lucas M. Oliveira, T. Yung Kong and Gabor T. Herman
a
b
c
d
e
Fig. 9.6: Components of the Digital Picture of Fig. 9.5c at Different Threshold Levels. a Component tree of the 2D digital picture presented in Fig. 9.5c, shown using the simplified representation in which each node appears as a point rather than as a set of pixels. b, c, d, and e The components at threshold levels 1, 2, 3, and 4, respectively; in each case the cross-hatched parts of the image consist of pixels that do not belong to any component because their intensities are below the threshold level. Each component shown in b, c, d, and e is a node of the tree a: Tree node v1 consists of the 98 pixels that are not cross-hatched in b, and tree nodes v2 , . . . , v8 are the correspondingly labeled components in c, d, and e
The component tree of the digital picture 9.5c is shown in Fig. 9.6a using the simplified representation in which each node is shown as a point rather than as a set of pixels. We will now describe a construction of this tree. When the picture is thresholded at the lowest intensity level t = 0, there is just one component, which consists of all 100 pixels in the image since all pixels have intensity ≥ t = 0. Thus thresholding the picture at level t = 0 yields just this one node, which is the root v0 of the component tree in Fig. 9.6a . The node’s level (i.e., the minimum of the intensities of its pixels) is 0. When the picture is thresholded at the intensity level t = 1, there is again just one component, because the two pixels of the image that have intensity less than 1 (the two black pixels in Fig. 9.5c, which are cross-hatched in Fig. 9.6b) do not separate v0 . This component is the tree node v1 in Fig. 9.6a; its cardinality is 98, and its level is 1 because it does contain pixels whose intensity is 1. When the picture is thresholded at the intensity level t = 2, all pixels in the crosshatched parts of Fig. 9.6c have intensity levels that are below the threshold and are therefore omitted. The remaining pixels consist of two components: As we see from 9.6c, one component consists of a single pixel in the top left of the image (node v2
9 Using Component Trees to Explore Biological Structures
13
in the tree) and the second component consists of all the other pixels with intensity ≥ t = 2 (node v3 in the tree). Since the only pixel in v2 has intensity 2, and there are many pixels in v3 that have intensity 2, both of these nodes have level 2. When the picture is thresholded at the intensity level t = 3, there are again two components: All pixels in the cross-hatched parts of Fig. 9.6d have intensity levels that are below the threshold t = 3; the remaining pixels consist of a component v4 of cardinality 5 and a component v5 of cardinality 12. We also see from Fig. 9.6d that each of v4 and v5 contains pixels that have intensity 3—v4 has two such pixels and v5 has three—so each of v4 and v5 is a node of level 3 in the component tree. These level 3 nodes are children of the level 2 node v3 because each of the sets v4 and v5 is contained in the set v3 . When the picture is thresholded at the intensity level 4, there are three components: All pixels in the cross-hatched parts of Fig. 9.6e have intensity levels that are below the threshold, and the remaining pixels consist of a component v6 of cardinality 3, a component v7 of cardinality 4, and a component v8 of cardinality 5. All the pixels in these components have intensity 4, so each of v6 , v7 , and v8 is a node of level 4 in the component tree. From 9.6d and 9.6e we see that v6 is contained in the level 3 node v4 and must therefore be a child of v4 in the tree. We similarly see that v7 and v8 are both contained in the level 3 node v5 and must therefore be children of that node.
9.2.5 A Method of Simplifying Component Trees Component trees of digital pictures are quite sensitive to noise and other small inaccuracies in the digital pictures, which can significantly deform or otherwise alter the structure of the tree. (For example, component trees of digital pictures produced by cryo-electron microscopy may have many low level nodes which do not represent any biological structure. Some of these nodes represent noise, and others may represent parts of the ice in which the specimen is embedded.) Tree simplification can greatly reduce the effects of these errors on the component tree, while retaining the essential structural information that the original tree contains. Simplification may also eliminate features of the component tree which represent structural information that is of no interest in a given application. There are other important reasons to simplify component trees. Visual exploration of unsimplified component trees may be very difficult because of the large numbers of nodes and edges they contain. In addition, the computational cost of manipulating (and in some cases even storing) representations of these large trees can be very high. Tree simplification may produce a much smaller tree that can be more efficiently manipulated and analyzed, and which is much easier for a user to explore interactively. Several tree simplification methodologies have been proposed over the years [3, 4, 5, 8, 26]. Here we will describe a three step simplification method called (λ , k)-simplification that was proposed in [8] and was shown there to be robust in
14
Lucas M. Oliveira, T. Yung Kong and Gabor T. Herman
the presence of noise and other small inaccuracies in the picture. As we will see in Sec. 9.2.6, there is some evidence that simplified component trees produced by this method can be used to distinguish between similar biological specimens in experimental 3D structures. Figure 9.7 illustrates the effects of the three steps of (λ , k)-simplification: 9.7a shows the unsimplified component tree of the digital picture that is surface-rendered in Fig. 9.2a. The trees produced by application of just step 1, just steps 1 and 2, and all three steps of a (λ , k)-simplification to this unsimplified tree are shown in 9.7b, 9.7c, and 9.7d, respectively. Each of the steps removes certain nodes and eliminates certain edges from the tree. In the final tree, no node (with the possible exception of the root) has just one child, the cardinality of every node is greater than the parameter k, and every edge is longer than the parameter λ. Here the length of an edge is defined to be the difference between the levels of the two nodes it connects. The values of λ and k must be carefully chosen. The larger the values of λ and k, the simpler the final tree will be. But if either parameter is too large then (λ , k)simplification will fail to preserve essential structural information. In the following subsections we describe the steps of (λ , k)-simplification in more detail.
Step 1 of (λ , k)-Simplification: Pruning Away Small Components Step 1 of (λ , k)-simplification prunes the tree by removing nodes of small cardinality: This step removes all nodes in a component tree that contain fewer than k+1 elements. If k is suitably chosen, many nodes that result from noise in the image will be removed by this step. Figure 9.8 shows the result of applying this step to the component tree of Fig. 9.4, in the case k = 1. Note that the two components removed in this step, v10 and v23 , are lighter colored in Fig. 9.8.
Step 2 of (λ , k)-Simplification: Pruning Away Short Branches To describe step 2 of (λ , k)-simplification we introduce the concept of a critical node: Recall that a node in a rooted tree is called a leaf if it has no children. A node is said to be critical if it is a leaf or it has at least two children. (Thus the nodes that are not critical are the nodes that have just one child.) Figure 9.9 is a copy of Fig. 9.4b in which the critical nodes have been circled. The closest critical proper ancestor of a node u is the node of highest level that is both a proper ancestor of u and a critical node. For example, the closest critical proper ancestor of the leaf v13 in Fig. 9.8 or Fig. 9.9 is the node v5 . Step 2 of (λ , k)-simplification prunes the tree by removing short branches. This pruning is applied to the result of step 1. Roughly speaking, the effect of this step is to remove all those leaves of the tree for which the difference between the level of the leaf and the level of its closest critical proper ancestor is less than or equal to λ .
9 Using Component Trees to Explore Biological Structures
15
a
b
c
d
Fig. 9.7: Simplifying a Component Tree. a Unsimplified component tree of the digital picture that is surface-rendered in Fig. 9.2a. Trees depicted in panels b, c, and d are respectively the results of performing step 1, steps 1 and 2, and all 3 steps of a (λ , k)-simplification on the unsimplified tree a. The final tree d is the same as the tree shown in Fig. 9.2c. The tree representation used here is explained in footnote5 . Nodes that are not the root of the tree but have just one child are not explicitly shown in the tree representation we use here; the trees a, b, and c have some such nodes
For a precise specification of step 2 and a suggested implementation of this step, we refer the reader to Sec. 2.4 of [8]. (In the example shown in Fig. 9.7, many of the nodes removed by this step represent components of the ice in which the specimen under study was embedded.) The tree shown in Fig. 9.10 was produced by applying step 2 to the tree in Fig. 9.8, assuming λ = 2. The three leaves v8 , v12 , and v18 are removed, though it would also have been correct to remove v17 instead of v18 . The leaf v8 is removed because the level of v8 is 9, the level of the closest critical proper ancestor of v8 (which is v7 ) is 8, and so the difference between these levels is just 1, which is less than or equal to λ = 2. Similarly, the leaf v12 is removed because the difference in level
16
Lucas M. Oliveira, T. Yung Kong and Gabor T. Herman
Fig. 9.8: Pruning Away Small Components. The effect of step 1 of (λ , k)simplification (removal of nodes of size ≤ k) on the component tree of Fig. 9.4 is shown, in the case k = 1. Only two nodes (v10 and v23 ) are removed from the tree of Fig. 9.4, as all other nodes of the tree consist of more than k pixels—i.e., more than 1 pixel, since k = 1. (Reproduced from [8])
Fig. 9.9: Illustration of the Concept of a Critical Node. A critical node in a rooted tree is a node that either has no children or has at least two children; these nodes are circled in the above tree. (Reproduced from [8])
between its level and the level of its closest critical proper ancestor v9 is only 2. Note, however, that just one of the two leaves v17 and v18 is removed, despite the fact that the difference in level between each of v17 and v18 and its closest critical proper ancestor v11 is only 2. This reflects the fact that step 2, as specified in [8], considers leaves for possible removal one at a time (in increasing order of level): After the removal of v12 and one of the leaves v17 and v18 , the closest critical proper ancestor of the other of v17 and v18 is v4 (since the nodes v9 and v11 are no longer critical) and, since the level of v4 differs from the level of v17 and v18 by more than λ = 2, the remaining one of the leaves v17 and v18 is not removed.
9 Using Component Trees to Explore Biological Structures
17
Fig. 9.10: Pruning Away Short Branches. The effect of simplifying the component tree of Fig. 9.8 by removing branches of length ≤ λ = 2. The nodes v8 , v12 , and v18 are removed from the component tree, though it would also have been correct to remove v17 instead of v18 . (Reproduced from [8])
Step 3 of (λ , k)-Simplification: Elimination of Non-Critical Nodes and Short Internal Edges The result of step 2 is used as the input tree for step 3, which is the last step of (λ , k)simplification. Roughly speaking, the effects of step 3 are to remove all non-critical nodes (with the exception of the root, which is not removed even if it is non-critical), and to also remove those critical nodes for which the difference in level between that critical node and its closest critical proper ancestor is ≤ λ . The nodes that remain are the nodes of the final simplified tree; a node u of the final tree is an ancestor in that tree of a node v just if u was an ancestor of v in the original unsimplified tree. We refer the reader to Sec. 2.5 of [8] for a precise specification of this step and a suggested way to implement it. Figure 9.11 shows the result of applying step 3 to the tree in Fig. 9.10, assuming λ = 2. The critical node v16 of the latter tree (the parent of v21 and v22 ) is removed, because the difference between the levels of v16 and its closest critical proper ancestor v15 is just 1, which is ≤ λ = 2. Removal of v16 causes v15 to become the parent of v21 and v22 . All non-critical nodes (other than the root v0 ) are also removed.
9.2.6 Demonstration of Potential Biological Applicability The experiment reported in Sec. 2.6 of [8] is a good example of how the simplification method described above can produce a simple and compact tree representation that captures the essential structure of a digital picture. In this example, two very similar macromolecules are differentiated by comparing their simplified component
18
Lucas M. Oliveira, T. Yung Kong and Gabor T. Herman
Fig. 9.11: Elimination of Non-Critical Nodes and Short Internal Edges. Result of applying step 3 of (λ , k)-simplification to the tree of Fig. 9.10, assuming λ = 2. The nodes and edges of the resulting tree are shown as thick black nodes and edges. (Reproduced from [8])
a
b
c
d
Fig. 9.12: Two Different Version of Adenovirus. a surface rendering and b central cross-section of a wildtype adenovirus. c surface rendering and d central crosssection of a mutant adenovirus. (Reproduced from [8])
trees. These two macromolecules, a mutant and a wildtype version of an adenovirus, are identical except for a change in a protein (called IIIa) [21]. The adenovirus has an icosahedral structure. At each of the 12 vertices of the icosahedron there is a substructure called a penton, and the rest of the surface of the icosahedron consists of 240 hexons. Surface renderings and central cross-sections of the two versions of the adenovirus are shown in Fig. 9.12. Figures 9.13 and 9.14 show unsimplified and simplified component trees of 3D digital pictures of both versions of the adenovirus. Each simplified tree has 252 leaves, corresponding to the 12 pentons and 240 hexons. For the wildtype version, the critical node of lowest level in the simplified tree (Fig. 9.14a) is the parent of all 252 leaves. However, in the case of the mutant
9 Using Component Trees to Explore Biological Structures
a
19
b
Fig. 9.13: Unsimplified Component Trees of a Wildtype and b Mutant Adenoviruses. Because of the large numbers of nodes and edges in a and b, the structures of the viruses are not apparent from these unsimplified trees
version the lowest-level critical node in the simplified tree (Fig. 9.14b) is the parent of just 12 leaves, which correspond to pentons; it is the grandparent of the 240 leaves that correspond to hexons. Thus in the simplified component tree of the mutant virus there is a substantial range of threshold levels (such as level A in Fig. 9.14b) which separate the 12 penton leaves from each other and from the 240 hexon leaves, but which do not separate the hexon leaves from each other. In the simplified component tree of the wildtype virus there is no such range of threshold levels. To investigate whether this reflects a genuine difference between the two versions of the virus or merely a difference between the specific density maps from which we produced our component trees, the authors also created simplified component trees of other density maps produced by reconstruction from a randomly selected set of 2000 out of 3000 available projection images of each version of the virus. Ten simplified trees of the mutant virus and ten of the wildtype virus were created in this way. As reported in [8], in all of the mutant virus trees there was a substantial range of threshold values with the properties mentioned in the previous paragraph, but there was no such range of threshold levels in any of the wildtype virus trees. So one might conjecture that this is a way to distinguish simplified component trees of mutant adenoviruses from simplified component trees of wildtype adenoviruses. Whether this is in fact the case can only be determined by testing the conjecture on many other images of both versions of the virus.
20
Lucas M. Oliveira, T. Yung Kong and Gabor T. Herman
a
b
Fig. 9.14: (λ , k)-Simplifications of Component Trees of Wildtype and Mutant Adenoviruses. Examples of trees produced by (λ , k)-simplifications of component trees of a our digital pictures of a wildtype adenovirus, and b our digital pictures of a mutant adenovirus. In a, the lowest-level critical node (represented by the horizontal line segment) is the parent of all 252 leaves of the tree. In b, the leaves below line B correspond to hexons and the leaves above line B correspond to pentons. Thus the lowest-level critical node in b (represented by the horizontal line segment above line A) is the parent of the 12 leaves which correspond to pentons, but is the grandparent of the 240 leaves which correspond to hexons. (Reproduced from [8])
9.3 Visualization Tools Using Component Trees As explained in Sec. 9.2, a component tree is a rooted tree which contains information regarding the connected components that are obtained when a digital picture is thresholded at different levels. More specifically, each node of a digital picture’s component tree is a connected component of the picture elements whose intensities are greater than or equal to the level of that node, and the ancestor-descendant relationships between nodes correspond to the inclusion relationships between these connected components. The results presented in [8] provide mathematical background for developing interactive visualization tools, based on component trees, that can be used to investigate biomolecular structures. We believe that such visualization tools will be of value to scientists and professionals who study and work with these structures. To illustrate how a tool of this kind can be used, we now describe three different scenarios in which regions in a density map can be identified by manual or automatic selection of nodes in a component tree. In these examples, a density map of the microtubule binding patterns of dimeric kinesins [11] (EMDB access code ˚ is used. (This 1032) with 100 × 100 × 100 voxels and a voxel spacing of 5.68A will be referred to as the EMDB-1032 density map from now on.) The structure of this macromolecule is composed of two different kinds of sub-structures: microtubules and dimer kinesins. The microtubules are long hollow cylinders made up of protofilaments (polymerised α- and β -tubulin dimers). The lateral association of 15 protofilaments generates the microtubule, a cylindrical structure with imperfect he-
9 Using Component Trees to Explore Biological Structures
21
lical symmetry. Dimer kinesins are motor proteins that move along the microtubule. The kinesins are attached to the microtubule by a binding site—a region on a protein with which specific other molecules and ions form a chemical bond. Figure 9.2 shows a surface rendering and a slice of the microtubule binding patterns of dimeric kinesins. The first scenario presented here shows how connected components of a digital picture can be identified by a manual selection of nodes in a component tree. For this we use a (0, 20)-simplification of a component tree of the EMDB-1032 density map. Figure 9.15a shows such a (0, 20)-simplified component tree in which a subtree is magnified. For this illustration we selected six nodes in the component tree presented in Fig. 9.15a. The positions of these nodes in the magnified subtree are indicated by the colored arrows. Recall that each node is a connected component of the voxels whose intensities (in the EMDB-1032 density map) are greater than or equal to the level of that node. Take the case of the node indicated by the red arrow in Fig. 9.15a. This component is represented by the red surface rendering in Fig. 9.15b. It contains one of the 15 vertical sections of the microtubule and four dimer kinesins. The nodes indicated by the blue, green, yellow, and pink arrows are represented by the blue, green, yellow, and pink kinesins in Fig. 9.15c. Figure 9.15d shows the above-mentioned components in their positions within the density map. It is important to note that in Fig. 9.15 the red component contains the other four segmented components (blue, green, yellow and pink). This reflects the fact that, in the component tree, the node indicated by the red arrow is an ancestor of the nodes indicated by the blue, green, yellow, and pink arrows. The node indicated by the cyan arrow in Fig. 9.15a is another descendant of the node indicated by the red arrow. This component lies in the vertical section of the microtubule to which the kinesins are attached in Fig. 9.15b; it is shown as the cyan segment in Fig. 9.15e. The second scenario illustrates how a component tree can be used to explore the components of a digital picture at various threshold levels. Recall that the level of any component tree node is defined to be the minimum of the intensity levels of the picture elements in that component. For any positive real number τ, we threshold a component tree at the level τ by omitting all the nodes of level less than τ. When we omit those nodes from the tree, we are left with a forest of subtrees. Our visualization tool can display surface renderings of those subtrees. (By a surface rendering of a subtree we mean a surface rendering of the node/component which is the root of that subtree, and which therefore contains all the other nodes of the subtree.) Figure 9.16 shows surface renderings of the subtrees obtained by thresholding a component tree at three different threshold levels τ1 < τ2 < τ3 . The component tree that is thresholded in this example is an (8, 50)-simplification of a component tree of the EMDB-1032 density map. Figure 9.16a shows that thresholding at level τ1 produces a forest of 17 subtrees. Fifteen of these subtrees have five or six leaves; these subtrees represent structures each of which comprises a vertical section of the microtubule and four or five attached kinesins. The other two subtrees have just one leaf each; these represent the two kinesins indicated by the cyan arrows in Fig. 9.16a. When the EMDB-1032
22
Lucas M. Oliveira, T. Yung Kong and Gabor T. Herman
a
b
c
d
e
Fig. 9.15: Interactive Digital Picture Exploration. a A (0,20)-simplified component tree constructed from the EMDB-1032 density map. b Surface rendering of the component (node) indicated in a by the red arrow. This component contains one of the 15 vertical sections of the microtubule and four dimer kinesins. c The four kinesins (the components indicated in a by the blue, green, pink, and yellow arrows). d Positions in the density map of these five components. e Relationship between these five components and a sixth component that is indicated by the cyan arrow in a
density map is thresholded at level τ1 , those two kinesins are not connected (within this density map) to the rest of the macromolecule. In Figs. 9.16b and c, which show surface renderings of the subtrees obtained by thresholding at levels τ2 and τ3 , respectively, the red and blue rectangles highlight surface renderings of those parts of
9 Using Component Trees to Explore Biological Structures
23
a
b
c
Fig. 9.16: Digital Picture Exploration by Threshold Level. The (8,50)-simplified component tree of the EMDB-1032 density map is shown with three different threshold levels τ1 , τ2 , and τ3 in panels a, b, and c. The right side of each panel shows surface renderings of components that are the roots of the forest produced when the tree is thresholded at the level indicated in that panel
the red and blue subtrees that lie on or below the green line. In Fig. 9.16c, the purple circle highlights a component which corresponds to a tree node that is indicated by the purple arrow. As the threshold level is gradually lowered from the highest to the lowest level that occurs in the density map, the visualization tool will show surface renderings of components as they come into existence or merge with other components. For example, when the threshold level falls to the level of the closest common ances-
24
Lucas M. Oliveira, T. Yung Kong and Gabor T. Herman
a
b
Fig. 9.17: Automatic Digital Tree Exploration. a The part of the component tree (top left) that is highlighted by a rectangle. The 22 small red disks indicate the nodes in the component tree that are associated with components whose volume is greater or ˚ 3 and is less or equal than 450,000 A ˚ 3 . b The various components equal to 420,000 A associated with these nodes
tor of the six leaves of the red subtree in Fig. 9.16b, the components that represent the five kinesins and the microtubule in the red rectangle in Fig. 9.16b will merge into a single component that represents one of the 15 vertical sections shown in Fig. 9.16a. Users can explore the component tree by interactively raising and lowering the threshold, and so gain a better understanding of the structural relationships between components. The last scenario presented here shows how a visualization tool can automatically find all the tree nodes that represent components for which the value of a certain attribute of interest falls within a user-specified range, and then display those components. We will illustrate this using the tree of Fig. 9.17a, and using each component’s volume (which we define in the next paragraph) as its attribute of interest. Let I be a digital picture based on voxels, and let ω denote the volume of a single ˚ 3 for the EMDB-1032 voxel (i.e., the cube of the voxel spacing); this is 193.10A density map we are using in this section. For any node c of a component tree of I, we define the volume of c as volume(c) = card(c)ω
(9.1)
where card(c) is the cardinality of the component c (i.e., the number of voxels in c). For any component tree and any two volume sizes w and z such that w ≤ z,
9 Using Component Trees to Explore Biological Structures
25
our visualization tool can produce a surface rendering of all the nodes c such that w ≤ volume(c) ≤ z. Figure 9.17 shows an example of this for the density map used in the previous scenarios. For this example the component tree was simplified using k = 30 and ˚ 3 and 450,000A ˚ 3. λ = 2, and the parameters w and z were selected to be 420,000A The 22 nodes shown as small red disks in the magnified part of Fig. 9.17a are the ˚ 3 and 450,000A ˚ 3 . Figure 9.17b shows nodes that have a volume between 420,000A a surface rendering of these 22 components. Note that only 16 components can be seen in Fig. 9.17b. This reflects the fact that the 22 nodes include six parent-child pairs: The six components that are the child nodes in these pairs are not visible because they are contained in the components that are their parents.
9.4 Potential Application of Component Trees to Macromolecular Docking 9.4.1 The Macromolecular Docking Problem Combining an atomic model of part of a macromolecular assembly with lowresolution imagery of that assembly as a whole gives a more detailed picture of the intact assembly [10, 29]. As Baker and Johnson wrote in the 1990s [1], this combination can yield a very useful pseudo-atomic precision model for the study of macromolecular assemblies and trigger new insights for structural biology. However, in order to create these pseudo-atomic models, a question needs to be answered: What is the position of the atomic model in the macromolecular assembly? It may be difficult to determine the correct position, especially as the atomic model and the macromolecular assembly images are usually produced independently by different techniques and with differing levels of detail. The process of combining the atomic model with the low-resolution image is known as docking or fitting the the model into the low resolution image (which is called the target). The potential usefulness of docking was demonstrated many years ago in the study of the structural biology of viruses [6] and muscles [19]. The work reported in [6] investigated particles of adenovirus type 2 and localized a minor component of their GON structure using a combination of electron microscopy and X-ray crystallography. Docking can be performed manually. In this approach a user—typically an expert in the field—interacts with a visualization tool to place the atomic structure into the low-resolution density map. This is a tedious and time-consuming process and is heavily dependent on the knowledge of the user. Despite the inherent subjectivity of manual docking, good results can be obtained (as reported in [24]), though the correctness of a manual docking might be contested by other professionals [15]. See [6, 19] for more on manual docking.
26
Lucas M. Oliveira, T. Yung Kong and Gabor T. Herman
Docking has also been performed automatically. One approach assumes that the imaged objects are rigid bodies, and is accordingly known as rigid-body docking. References [20, 32] review methods of this kind which carry out a systematic search, over three translational and three rotational degrees of freedom, to find a position and orientation for the atomic model in the low-resolution map that optimizes some quality-of-fit measure. However, the underlying assumption of rigidbody docking that the imaged objects are rigid is very often inappropriate for complex macromolecules, because such a molecule may assume different conformations. It is usually necessary to change the conformation of the high resolution structure to match the conformation observed in the low-resolution density map. Flexible docking methodologies allow for such conformation changes. Several different flexible docking strategies have been proposed. The main problem is to determine what deformation should be applied to the high-resolution density map to match the structural conformation in which the corresponding subunit occurs within the low-resolution density map. Various methods have been used to solve this problem. For example, [30] uses molecular dynamics simulation to compute the conformation changes, [25, 27] apply linear combinations of low-frequency normal modes to deform the atomic structure, and [12] implements Monte Carlo simulations to maximize cross-correlation coefficients and simulate the motion of the biomolecule as a collection of rigid clusters. In summary, the core task of a docking procedure is to find the spatial relationship between a low-resolution density map of a complete biological specimen and atomic models of one or more of its subunits. The large computation time, the variety of possible structural conformations of the subunits, and the lack of detail in low-resolution maps are some of the challenges that docking methodologies need to overcome.
9.4.2 A Tentative Docking Methodology Based on Component Trees As far as we are aware, none of the docking methods that have been reported in the literature makes use of component trees. Nevertheless, we believe that the possibility of using component trees to solve docking problems is worthy of investigation, and this is now a focus of our research. This subsection and the next will outline and illustrate an approach that seems quite promising to us. However, these subsections should be regarded as a snapshot of work in progress and not as a report on a completed project. The three main steps of the docking strategy we have in mind are as follows: In step 1, a low-resolution target density map and a PDB file (a file with the atomic coordinates of the atomic structure) are received as input. In step 2, a high-resolution density map for the atomic model is created from the PDB file. In step 3, the position and orientation of the high-resolution model density map in the low-resolution target
9 Using Component Trees to Explore Biological Structures
27
Fig. 9.18: A Three-Step Docking Strategy. The inputs (a low-resolution target density map and a PDB file) are received in step 1. Then, in step 2, a high-resolution density map for the atomic model is created using the atomic coordinates specified in the PDB file. Finally, in step 3, a position and orientation of the high-resolution model density map within the low-resolution target density map are found using an algorithm based on component trees
density map are found using an algorithm that is based on component trees. Figure 9.18 illustrates the basic pipeline for this docking strategy. Step 3 of our docking strategy can itself be understood as consisting of three substeps (i), (ii), and (iii): In substep (i) (of step 3), we construct simplified component trees of the target and the model density maps. In substep (ii) we find, in the target’s simplified component tree, a subtree that has approximately the same structure as the model’s simplified component tree. In substep (iii), we attempt to fit the model into the region of the target that is represented by the subtree found in substep (ii). This three-substep process for carrying out step 3 of Figure 9.18 is illustrated in Fig. 9.19. We believe that the concept of a tree embedding [9] may provide a basis for an implementation of substep (ii). A tree embedding is a mapping of the nodes of one rooted tree into the nodes of another rooted tree that preserves the ancestordescendant relationships among nodes and does not introduce any new ancestordescendant relationships. The concept is otherwise totally flexible (in the sense that its definition imposes no other constraints) regarding the image of each node. Using mathematical notation, we formally define a tree embedding of a rooted tree T ∗ = (N ∗ , E ∗ ) into a rooted tree T = (N, E) as a map ϕ : N ∗ → N such that ϕ(n∗ ) is a descendant of ϕ(m∗ ) in T if, and only if, n∗ is a descendant of m∗ in T ∗ . A tree embedding ϕ of T ∗ into T is said to be a root-to-root tree embedding if ϕ(root(T ∗ )) = root(T ), where root(T ∗ ) and root(T ) denote the roots of T ∗ and T . 0 Figure 9.20 shows an example of a root-to-root tree embedding of a tree Tv0 into a 0
0
tree T . (Here Tv0 is the subtree of a tree T 0 that is rooted at the node v00 .) In addition, 0 the caption of Fig. 9.20 describes three slightly different mappings and explains why two of the three are not tree embeddings. Every tree embedding is a one-to-one map (meaning that different nodes in N ∗ map to different nodes of N). This is simply because every tree node is both an ancestor and a descendant of itself, and so a node in N cannot be the image of two
28
Lucas M. Oliveira, T. Yung Kong and Gabor T. Herman
Fig. 9.19: A Three-Substep Process for Carrying Out Step 3 of Figure 9.18. In substep (i), the target density map (light gray surface rendering) and the model density map (orange surface rendering) are received from steps 1 and 2 of Figure 9.18, and simplified component trees of these two density maps (shown below the surface renderings) are constructed. Substep (ii) finds a subtree of the target’s simplified component tree that has approximately the same structure as the model’s simplified component tree. This subtree is shown within the peach-colored oval. In substep (iii), the model density map is fitted into the region of the target density map that is represented by the subtree found in substep (ii)
different nodes in N ∗ as two different nodes cannot have the property that one is both an ancestor and a descendant of the other. A labeled tree is a triple (N, E, Ω ), where (N, E) is a rooted tree and Ω assigns to every node n in N a real value Ω (n). Component trees as we have defined them in this chapter are labeled trees, with the label of a node being its level. But this is not a good labeling for docking purposes; the target and model density maps are likely to have been obtained by very different imaging methodologies and so the physical meaning of a node’s level is unlikely to be similar in the two cases. Instead, we want to use a labeling for which it is reasonable to measure the goodness of a tree embedding ϕ (for docking purposes) by the similarity between the labels assigned to the nodes in a candidate subtree of the target’s component tree and the labels assigned to their respective image nodes under ϕ in the model’s component tree. For this reason, we now introduce the concept of a labeled component tree of a digital picture: We will use this term to mean any labeled tree (N, E, Ω ) such that N is the set of nodes and E the set of edges of that digital picture’s component tree. Note that the labeling map Ω of a labeled component tree (N, E, Ω ) of a digital picture is not determined by that digital picture; Ω is chosen by us. One possibility would be to label each component tree node with its volume (as defined by (9.1)); that is the labeling used in the simple example we describe in the next subsection. It is easy to extend the concept of a tree embedding to labeled trees: Given labeled trees T = (N, E, Ω ) and T 0 = (N 0 , E 0 , Ω 0 ), we say ϕ is a labeled tree embedding (respectively, a root-to-root labeled tree embedding) of T 0 into T if ϕ is a tree embedding (respectively, a root-to-root tree embedding) of (N 0 , E 0 ) into (N, E).
9 Using Component Trees to Explore Biological Structures
29
Fig. 9.20: Example of a Tree Embedding. The mapping shown by the red arrows 0 0 is a root-to-root tree embedding of the tree Tv0 into the tree T (where Tv0 is the dark 0
0
subtree of the tree T 0 ). Note that if we alter this tree embedding so it maps v20 to v6 then it will no longer be a tree embedding, because the mapping will no longer satisfy the condition that the descendant v30 of v20 must be mapped to a descendant of the image of v20 . If we alter the tree embedding so it maps v10 to v8 then the mapping will again no longer be a tree embedding, this time because the image of v10 will be a descendant of the image of v20 even though there is no ancestor-descendant relationship between v10 and v20 . But if, instead, we alter the tree embedding so it maps v20 to v6 and maps v30 to v8 , then it will still remain a root-to-root tree embedding of Tv00 into T 0
Now we will tentatively propose a way to carry out substep (ii) of our docking methodology. Let T model be a labeled component tree of the model, and let T target be a labeled component tree of the target. Then our tentative proposal is to find a labeled subtree of T target for which there exists an optimal root-to-root labeled tree embedding of that labeled subtree into T model , where “optimal” is currently taken to mean “preserves node labels as closely as possible” (a condition that will be stated precisely in the next paragraph). Here the term “labeled subtree” is defined as follows. Let T = (N, E, Ω ) be any labeled tree and c be an element of N. Then the labeled subtree of T at c is the labeled tree Tc = (Nc , Ec , Ωc ), where • Nc is the set of all descendants of c in T (which implies that c ∈ Nc ); • Ec is the set of those edges in E for which both the parent node and the child node are in Nc (see footnote2 ); • Ωc is the restriction of Ω to Nc . To formalize our current notion of an optimal root-to-root labeled tree embedding, we first define the component inconsistency of a labeled tree embedding. Given a labeled tree embedding ϕ of T 0 = (N 0 , E 0 , Ω 0 ) into T = (N, E, Ω ), we define the component inconsistency of ϕ to be the following nonnegative real number: (9.2) ∑ Ω 0 (d 0 ) − Ω (ϕ(d 0 )) . d 0 ∈N 0
30
Lucas M. Oliveira, T. Yung Kong and Gabor T. Herman
For any two labeled trees T 0 = (N 0 , E 0 , Ω 0 ) and T = (N, E, Ω ), we write ωT 0 ,T to denote the minimal value of the component inconsistency over all root-to-root tree embeddings of T 0 into T , and we write ϖT 0 ,T to denote the minimal value of ωT 00 ,T c over all c0 ∈ N 0 . A root-to-root embedding of a labeled subtree of T 0 into T will be considered to be optimal if the embedding’s component inconsistency is ϖT 0 ,T . Using the notation we have just defined, our tentative proposal for carrying out substep (ii) of our docking methodology can be restated as follows: Let T model = (N, E, Ω ) and T target = (N 0 , E 0 , Ω 0 ) be labeled component trees of the model and the target, respectively. Then the tentative proposal is to find a node c0 ∈ N 0 and a target root-to-root labeled tree embedding ϕ of T c0 into T model for which the component inconsistency of ϕ is ϖT target ,T model .We are hopeful that, if we use appropriate node labelings in T model and T target , then such a node c0 and embedding ϕ will often provide a good indication of where and how to dock the model into the target, and so allow us to carry out substep (iii). As a concrete example of how this might be done, suppose our high-resolution model is a density map of a protein generated from the X-ray coordinates found in the PDB and our low-resolution target is a density map of a macromolecular complex which contains that protein. Let T model and T target be labeled component trees of these images, and suppose we have succeeded in finding a node c0 of T target target and an embedding ϕ of T c0 into T model that have the above-mentioned properties. We would then consider placements of the protein which put its center of mass near the center of mass of the region (in the target) that comprises all voxels of the component c0 . (We might also look for positions and orientations of the protein such target are not far from the that the centers of mass of the nodes d 0 of the subtree T c0 0 centers of mass of the nodes ϕ(d ) of the protein. But if more than one root-to-root target embedding ϕ of T c0 into T model has a component inconsistency that is fairly close to ϖT target ,T model , then we should do this for all such embeddings ϕ if we do it for one of them.) This methodology depends on our being able to find an efficacious way to label the nodes of the component trees of the target and the model. In the example we discuss below we label each node with its volume (as defined by (9.1)), but this labeling may not give satisfactory results in other docking problems. It is also entirely possible that better results would be obtained if instead of the component inconsistency (9.2) we used another measure of the badness of a root-to-root labeled tree embedding. For instance, we might consider replacing N 0 with N 0 − {root(T 0 )} in (9.2). These are issues which our research must still address.
9.4.3 A Simple Example We now give an example of a problem for which the docking methodology we have tentatively proposed gives satisfactory results. Although our ultimate goal is to develop a method for docking three-dimensional models into three-dimensional targets, in the simple example we present here the model and the target density maps
9 Using Component Trees to Explore Biological Structures
a
31
b
d
e
c
f
Fig. 9.21: The Model and Target Images Used to Illustrate Our Docking Methodol˚ resolution ogy, and Their Component Trees. Surface renderings of GroEL at 4.2 A ˚ and GroEL + GroES at 7.7 A resolution are shown in a and d, respectively. (In d, the top ring is the GroES ring; the middle and the bottom rings are GroEL rings.) b shows a slice from the middle of the bottom ring of the GroEL density map (left), and a cropped region of this slice (right); the green curve encloses the region that was cropped. We use the cropped region as our model image. e shows a slice from the middle of the lower GroEL ring of the GroEL + GroES density map; we use this slice as our target image. Component trees of the model image (the cropped region of the slice from the GroEL density map) and of the target (the GroEL slice from the GroEL + GroES density map) are shown in c and f, respectively. The surface renderings were produced using Chimera [18] and the slices were selected using XMIPP [23]
are two-dimensional. In fact they are derived from slices of two similar macromolecular images at different resolutions. The first macromolecule, a native GroEL (EMDB access code 5001; the claimed ˚ is composed of 14 identical copies of the same chaperonin resolution is 4.2 A), protein that are organized in two circular rings of seven proteins each. Its 200 × ˚ The second macromolecule 200 × 200 density map has a voxel spacing of 1.06 A. is a GroEL + GroES in the ATP-bound state (EMDB access code 1180; the claimed ˚ This macromolecular complex is composed of 21 chaperonin resolution is 7.7 A). proteins: In addition to a GroEL double-ring of 14 chaperonin proteins, this complex
32
Lucas M. Oliveira, T. Yung Kong and Gabor T. Herman
has an extra circular GroES ring that consists of 7 chaperonin proteins. Its 192 × ˚ Figs. 9.21a and d show surface 192×192 density map has a voxel spacing of 1.40 A. renderings of these two density maps. In 9.21d the GroES ring is at the top; the other two rings are GroEL rings. One slice was extracted from each of the two density maps, and the densities in these two slices were then quantized to a set of 20 equally spaced values. The slices that were extracted are from (approximately) the middle of the bottom GroEL ring in Figs. 9.21a and d. ˚ resolution was used as the target image. The slice from GroEL + GroES at 7.7 A This image is shown in Fig. 9.21e. To create the model image (i.e., the image that needs to be fitted into the target image) we manually selected, from the slice of ˚ resolution, a part that corresponds to one of the 7 proteins that the GroEL at 4.2 A slice passes through. The slice of GroEL and the model image extracted from it are shown in Fig. 9.21b; the model image is the part of the slice that is enclosed by the green contour. As each of the two slices was extracted from the middle of a ring of GroEL chaperonin proteins, the labeled component tree of the target image should have parts that have approximately the same structure as the labeled component tree of the model image. As discussed in the previous subsection, substep (i) of step 3 of our tentative docking methodology is to construct simplified component trees of the model and the target images. Unsimplified component trees of the model and the target images are shown in Figs. 9.21c and f, respectively. These trees were simplified using the methodology described in Subsection 9.2.5 to eliminate small components produced by noise or by the cropping process used to create the model image. The (1, 10)-simplification of the target component tree and the (2, 10)-simplification of the model component tree are shown in Figs. 9.22a and 9.22b, respectively. Recall that, with the possible exception of the root, all the nodes of a simplified component tree are critical. So every node in Figs.9.22a and b that is not a leaf and also is not the root of its tree is represented by a horizontal segment. Even though the simplified component trees in Figs. 9.22a and 9.22b have a much simpler structure than their unsimplified versions, they capture the essential structural information in the images. The tree shown in Fig. 9.22a, for example, has a subtree for each of the seven chaperonin proteins that appear in the target image. The roots of these subtrees are the nodes with exactly two or three leaf children. Compared with the unsimplified tree in Fig. 9.21f, the tree of Fig. 9.22a is simpler mainly due to the pruning away of small components (see Subsection 9.2.5); the choice k = 10 results in the removal of approximately 40% of the nodes. To carry out substep (ii) of Figure 9.19, we first label every node in the simplified component trees with the volume of the component. For example, if a node in the component tree of the target image contains 20 voxels, then the label of that node ˚ 3 (as 1.40 A ˚ is the voxel spacing for the will be 20 × 1.40 × 1.40 × 1.40 = 54.88 A target image). A node in the labeled component tree of the model image with the ˚3 . same number of voxels would have the label 20 × 1.06 × 1.06 × 1.06 = 23.82 A
9 Using Component Trees to Explore Biological Structures
33
a
c
b
d
Fig. 9.22: Using Component Trees for Macromolecular Docking. a (1, 10)simplified component tree constructed from the target image, and a subtree whose nodes constitute the domain of an optimal root-to-root tree embedding into the simplified component tree of the model image (see b). The nodes of the subtree are labeled with their volumes (measured in cubic angstroms). b (2, 10)-simplified component tree constructed from the model image; each node is labeled with its volume. c The target image. The green region in c is the region that is represented by the subtree shown in a: This is where our methodology suggests that the model image, shown in d, should be fitted
Let T = (N, E, Ω ) and T 0 = (N 0 , E 0 , Ω 0 ) be the labeled component trees of the model and the target images, respectively. To complete substep (ii), we find a node c0 ∈ N 0 and a root-to-root labeled tree embedding ϕ of Tc00 into T such that the component inconsistency of ϕ is ϖT 0 ,T .
34
Lucas M. Oliveira, T. Yung Kong and Gabor T. Herman
It turns out that the green node in Fig. 9.22a is a node c0 ∈ N 0 for which such an optimal embedding ϕ exists. So the region of the target image that is given by the green node in Fig. 9.22a is a region where our tentative methodology suggests that the model be fitted. This is shown as the colored region in Fig. 9.22c. Now we can see that this region is in fact the position of one of the proteins in the GroEL + GroES slice. So, in this example, the docking suggestion given by our tentative methodology is correct.
9.5 Summary This chapter has given an introduction to component trees and presented some applications of component trees of macromolecular images. Using simple one- and two-dimensional images, we have explained what a component tree is. We have also explained how such trees can be simplified (and why tree simplification may be desirable). An example has been given of how simplified component trees may be able to distinguish images of similar macromolecular objects. In our example the objects are two different but similar versions of an adenovirus. We believe that an interactive visualization tool which can display those regions of an image that are given by specified component tree nodes is of great value when studying the structure of a macromolecular image. We have given a few illustrations of how such a visualization tool may be used. We have also presented some ideas for using component trees to solve macromolecular docking problems. We hope to develop and refine these preliminary ideas into a practical docking methodology.
Acknowledgment The work presented here is currently supported by the National Science Foundation (award number DMS-1114901). We are grateful to Jos´e-Mar´ıa Carazo and Joachim Frank for their advice on this chapter based on careful reading of the originally submitted material.
References 1. Baker TS, Johnson JE (1996) Low resolution meets high: towards a resolution continuum from cells to atoms. Curr Opin Struc Biol 6(1):585–594 2. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The Protein Data Bank. Nucl Acids Res 28(1): 235–242
9 Using Component Trees to Explore Biological Structures
35
3. Carr H, Snoeyink J, Axen U (2003) Computing contour trees in all dimensions. Comput Geom 24(2):75–94 4. Chi Y, Muntz RR, Nijssen S, Kok JN (2005) Frequent subtree mining – an overview. Fundam Inf 66(1–2):161–198 5. Edelsbrunner H, Harer J (2010) Computational Topology: An Introduction. Am Math Soc, Providence 6. Furcinitti PS, van Oostrum J, Burnett RM (1989) Adenovirus polypeptide IX revealed as capsid cement by difference images from electron microscopy and crystallography. EMBO J 8(12):3563–3570 7. Herman GT (1998) Geometry of Digital Spaces. Birkh¨auser, Boston 8. Herman GT, Kong TY, Oliveira LM (2012) Provably robust simplification of component trees of multidimensional images. In: Brimkov VE, Barneva RP (eds) Digital Geometry Algorithms. Lecture Notes in Computational Vision and Biomechanics, vol 2. Springer, Netherlands, pp 27–69 9. Herman GT, Kong TY, Oliveira LM (2012) Tree representation of digital picture embeddings. J Vis Commun Image Represent 23(6):883–891 10. Heymann JB (2001) Bsoft: image and molecular processing in electron microscopy. J Struct Biol 133(1–2):156–69 11. Hoenger A, Thorm¨ahlen M, Diaz-Avalos R, Doerhoefer M, Goldie KN, M¨uller J, Mandelkow E (2000) A new look at the microtubule binding patterns of dimeric kinesins. J Mol Biol 297(5):1087–1103 12. Jolley CC, Wells SA, Fromme P, Thorpe MF (2008) Fitting low-resolution cryo-EM maps of proteins using constrained geometric simulations. Biophys J 94(5):1613–1621 13. Kiser PD, Lodowski DT, Palczewski K (2007) Purification, crystallization and structure determination of native GroEL from Escherichia coli lacking bound potassium ions. Acta Crystallogr Sect F: Struct Biol Cryst Commun 63(Pt 6):457–461 14. Lawson CL, Baker ML, Best C, et al (2011) EMDataBank.org: unified data resource for CryoEM. Nucl Acids Res 39(1):D456-D464 15. Ludtke SJ, Lawson CL, Kleywegt GJ, Chiu W (2010) The 2010 cryo-EM modeling challenge. Biopolymers 97(9):651-654 16. Najman L, Couprie M (2006) Building the component tree in quasi-linear time. IEEE Trans Image Process 15(12):3531–3539 17. Penczek PA, Yang C, Frank J, Spahn CMT (2006) Estimation of variance in single-particle reconstruction using the bootstrap technique. J Struct Biol 154(2):168–183 18. Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, Ferrin TE (2004) UCSF Chimera–a visualization system for exploratory research and analysis. J Comput Chem 25(13):1605–1612 19. Rayment I, Holden HM, Whittaker M, Yohn CB, Lorenz M, Holmes KC, Milligan RA (1993) Structure of the actinmyosin complex and its implications for muscle contraction. Science 261(5117):58–65 20. Rossmann MG, Bernal R, Pletnev SV (2001) Combining electron microscopic with X-ray crystallographic structures. J Struct Biol 136(3):190–200 21. San Martin C, Glasgow JN, Borovjagin A, Beatty MS, Kashentseva EA, Curiel DT, Marabini R, Dmitriev IP (2008) Localization of the N-terminus of minor coat protein IIIa in the adenovirus capsid. J Mol Biol 383(4):923–934 22. Sarioz D, Kong TY, Herman GT (2006) History trees as descriptors of macromolecular structures. In: Bebis G et al (eds) Advances in Visual Computing. 2nd International Symposium Visual Computing, Lake Tahoe, November, 2006. Lecture Notes in Computer Science, vol 4291. Springer-Verlag, Heidelberg, p 263–272 23. Scheres SHW, Nu˜nez-Ramirez R, Sorzano COS, Carazo JM, Marabini R (2008) Image processing for electron microscopy single-particle analysis using Xmipp. Nat Protoc 3(6):977– 990 24. Siebert X, Navaza J (2009) Urox 2.0: an interactive tool for fitting atomic models into electronmicroscopy reconstructions. Acta Crystallogr Sect D: Biol Crystallogr 65(Pt 7):651–658
36
Lucas M. Oliveira, T. Yung Kong and Gabor T. Herman
25. Suhre K, Navaza J, Sanejouand H (2006) NORMA: a tool for flexible fitting of high-resolution protein structures into low-resolution electron-microscopy-derived density maps. Acta Crystallogr Sect D: Biol Crystallogr 62(Pt 9):1098-1100 26. Takahashi S, Takeshima Y, Fujishiro I (2004) Topological volume skeletonization and its application to transfer function design. Graph Models 66(1): 24–49 27. Tama F, Miyashita O, Brooks CL (2004) Normal mode based flexible fitting of high-resolution structure into low-resolution experimental data from cryo-EM. J Struct Biol 147(3):315–326 28. Tarjan RE (1975) Efficiency of a good but not linear set union algorithm. J ACM 22(1):215– 225 29. Topf M, Lasker K, Webb B, Wolfson H, Chiu W, Sali A (2008) Protein structure fitting and refinement guided by cryo-EM density. Structure 16(2):295–307 30. Trabuco LG, Villa E, Mitra K, Frank J, Schulten K (2008) Flexible fitting of atomic structures into electron microscopy maps using molecular dynamics. Structure 16(5):673–683 31. Van Heel M, Gowen B, Matadeen R, Orlova EV, Finn R, Pape T, Cohen D, Stark H, Schmidt R, Schatz M, Patwardhan A (2000) Single-particle electron cryo-microscopy: towards atomic resolution. Q Rev Biophys 33(4):307–369 32. Wriggers W, Birmanns S (2001) Using Situs for flexible and rigid-body fitting of multiresolution single-molecule data. J Struct Biol 133(2–3):193–202