Efficiently storing and extracting phylogenetic trees for ...

28 downloads 0 Views 60KB Size Report
compare. The Tree of Life project proposes to build a massive ... In this paper, we briefly describe the Tree of Life simula- ..... tree of life web project. http://tolweb.org/tree/phylogeny.html. [3] Treebase: a database of phylogenetic knowledge.
Efficiently storing and extracting phylogenetic trees for simulation Susan B. Davidson, Junhyong Kim and Yifeng Zheng University of Pennsylvania [email protected], [email protected], [email protected]

Abstract Various phylogenetic tree generation algorithms have been proposed over the past decade. However, since they use different models of evolution and other assumptions, they are difficult to compare. The Tree of Life project proposes to build a massive simulation tree as the “true tree” against which to benchmark these algorithms. Due to the size of the simulation tree and the fact that most algorithms do not scale to its size, it has been proposed to store the simulation tree in a database and extract meaningful, smaller, subtrees as benchmarks for the algorithms. In this paper, we briefly describe the Tree of Life simulation project and summarize the operations that are required to generate meaningful simulation trees. After analyzing characteristics of the phylogenetic tree data as well as the required operations, we proposed a storage system with schema based on labeling. Furthermore, we show how to implement the operations by translating them to efficient SQL queries.

1 Introduction Phylogenetic trees have been used to identify and understand the evolutionary relationships between different species by an unlabeled tree with the species studied occurring as leaf nodes and the interior nodes representing evolutionary events. Phylogenetic trees have been used in various fields, such as molecular biology, genetics and evolution, by drawing inferences from the structure of the tree or from the way the character states map onto the tree. Biologists can then use these clues to build hypotheses and models of important events in history. The number of phylogenetic studies is increasing rapidly, and in response a variety of phylogenetic tree generation algorithms [7, 4, 6, 11] have been proposed. Given the variety and complexity of these algorithms, it is natural to ask how to benchmark them. One approach is to use simulation. Simulation has been used to solve various complex problems in networking [12] and social sciences [5], and has the advantages of low cost and easily controlled experimental conditions. For these reasons, the Tree-of-life (TOL) project [1] proposes to simulate the “true” phylogenetic tree in order to benchmark various tree generation algorithms. To allow a variety of phylogenetic tree generation algorithms to be benchmarked, the TOL project uses various known evolutionary models as input to simulate different evolutionary assumptions. Scalability is another important feature, since the size of the true tree can be very

small or very large depending on the application. Users must therefore be allowed to specify the size of the simulation tree. The idea behind benchmarking a phylogenetic tree generation algorithm is to use the sequences attached to the leaves of the simulation tree as input for the algorithm, and from this to generate a result tree. The result tree can then be compared to the simulation tree to judge the performance of the algorithm. One immediate question is: why do we store the simulation tree in a database rather than sending the simulation algorithm to the user and letting them generate the simulation tree directly? One reason for this choice is that in order for the benchmark to be fair and effective it must cover a lot of general cases. The simulation tree is therefore extremely large, and takes a long time to generate. Also, current phylogenetic tree generation algorithms can also handle only a relative small set of species (several hundred to several thousand) as input and generate a small phylogenetic tree. We must therefore be able to generate a randomized small version of the simulation tree. To support this, we must be able to efficiently answer the following types of operations: Random sampling a set of species. Due to the size of the simulation tree and the fact that most algorithms do not scale to its size, we need a method to retrieve a subset of species. The most straightforward strategy is random sampling. The random sampling is a effective method to guarantee the fair of benchmark by sampling every species in same probability. Random Sampling a set of species with respect to a given time. However, in some cases, biologist want all the structure before a given time to be sampled out to guarantee the result samples taxa distributes in the whole phylogenetic tree. To achieve the goal, random sampling is not enough. In this case, we need to sample all nodes just below this point in the phylogenetic tree then randomly sample require number of leaf nodes from the trees root by the selected nodes. As an example, a user want to random sampling 4 species with respect to weight 1 from the simulation tree in figure 1, where a node represents a specie, the (optional) tag of a node is the name of the specie, the edge

  

 

  

  

child Bha Lla Spy y Syn z Bsu x

!"

  



Figure 1: Sample phylogenetic tree ) *+, 6 78 #

%$Figure 9;:=A@CBEDF2: >A@HGJILKJM The

- ./

0 12

&'(

345

induced

parent x y y z z x x NULL

weight 2.5 1 1 0.5 1.5 0.75 1.25

Figure 3: Store phylogenetic tree in edge table

2 subtree

for

leaves

between two nodes represents the parent-child relationship between two species, and the label of an edge is its weight (typically, the evolution time from the parent specie to child specie). Then we will search all nodes (include leaves) with total weight from root to this node larger than 1 and no other node with this property on the 9 path from root to this node. Four nodes Bha, x, Syn, M BSU satisfy the condition, where x is the parent node of Lla and Spy. Then for each node, we random select 4/4=1 leaves from the subtree rooted 9 M 9 by the node. The resultM will be Bha, Lla, Syn, BSU or Bha, Spy, Syn, BSU . Reconstructing the tree structure induced by a subset of leaf nodes in the simulation tree. To compare a result tree against the simulation tree, we must be able to generate the induced tree over the sampled species used in the phylogenetic tree generation algorithm. For example, suppose we have the simulation tree in 9;:=