School of IT Technical Report AN INTEGER ... - Semantic Scholar

1 downloads 0 Views 176KB Size Report
and partially supported by, the School of Information Technology at the University ... Each internal node in a tree corresponds to an event: The new root node of.
School of IT Technical Report

AN INTEGER LINEAR PROGRAMMING FORMULATION OF THE COPHYLOGENY RECONSTRUCTION PROBLEM TECHNICAL REPORT 629 RAN LIBESKIND-HADAS

DEPARMENT OF COMPUTER SCIENCE HARVEY MUDD COLLEGE, USA

MICHAEL A. CHARLESTON

SCHOOL OF INFORMATION TECHNOLOGIES THE UNIVERSITY OF SYDNEY

AUGUST 2008

An Integer Linear Programming Formulation of the Cophylogeny Reconstruction Problem Ran Libeskind-Hadas∗

Michael A. Charleston†

Abstract The cophylogeny reconstruction problem is that of finding minimal cost explanations of differences between evolutionary histories of ecologically linked groups of biological organisms. A general form of this problem is known to be NP-complete [2] and the problem is conjectured to remain intractable under a variety of assumptions. Therefore, heuristics and optimized search algorithms are needed for this problem. This report shows how the problem can be formulated as an integer linear programming (ILP) problem. Using this formulation, ILP solvers may be used to solve the problem or, alternatively, relaxation methods may be used to find approximate solutions to the problem.

Keywords: Coevolution, cophylogeny, integer linear programming

1

Introduction

This report describes the results of our study into the use of 0-1 integer linear programming (ZOILP) to solve the cophylogeny problem. Given are a host tree H, a parasite tree P , a mapping from the leaves of P into (and usually onto) the leaves of H, and a cost for each of four permitted operations: cospeciation, duplication, host switching, and loss. The objective is to determine the minimum cost reconstruction of host/parasite associations for the given inputs. Background and previous work on ∗

Department of Computer Science, Harvey Mudd College, USA. This work was conducted at, and partially supported by, the School of Information Technology at the University of Sydney, Australia. Additional support was provided by the Mellon Foundation. † School of Information Technology, Sydney Bioinformatics, and Centre for Mathematical Biology, University of Sydney, Australia.

1

this problem can be found in [1, 3, 4, 5, 6]. In addition, results on the computational complexity of this problem can be found in [2]. In Section 2 we describe a 0-1 integer linear programming formulation of this problem. In Section 3 we describe our implementation and discuss its utility.

2

ILP Formulation of the Cophylogeny Problem

Assume that we are given two putative phylogenetic trees: A host tree H and a parasite tree P . Let L(T ) be the set of leaves of a tree T , let `(T ) denote the number of leaves of T , and let N (T ) denote the total number of nodes of T . Thus, the number of internal nodes of T is `(T ) − 1. We assume that `(H) ≥ `(P ). Let ϕ : L(P ) 7→ L(H) denote a mapping from leaves of P to leaves of H. We modify the host phylogenetic tree by creating a “new root” vertex with a single edge to the root of the original tree. This is required in order to account for events in the parasite phylogeny that predate the most recent common ancestor in the host phylogeny. The trees are now interpreted as follows: The timespan of the tree begins with the root and ends at the leaves, which are assumed to occur concurrently at “current time”. An edge from parent u to child v represents the lifetime of species v. If v is an internal node, it speciates at the end of its lifetime. The lifetime of v is assumed to be a half-open time interval: open at the beginning and closed at the end. Each internal node in a tree corresponds to an event: The new root node of outdegree 1 at the top of a tree corresponds to the beginning of the root node’s lifetime. Each other internal node corresponds to a speciation event. All the leaves of both trees correspond to “current time”. Thus, in order to allow for every possible relative ordering of events, we introduce `(H) + `(P ) − 1 “relative” times: one for each of the `(H) − 1 speciation events in H, one for each of the `(P ) − 1 speciations events in P , and one additional current time for the leaves of both trees. These relative times allow us to represent the relative ordering of speciation events. We now contruct a Boolean satisfiability problem comprising a collection of Boolean variables and expressions over those variables, such that a satisfying valuation of the variables corresponds to a feasible solution to the cophylogeny problem (i.e. a vaild collection of cospeciation, duplication, host-switching, and loss events that is consistent with the given trees H and P and the mapping ϕ). Then, we convert the Boolean expressions into conjunctive normal form (CNF) using the standard algorithm and then convert the CNF expressions into a 0-1 integer linear program (ZOILP). The objective function of the ZOILP will correspond to the cost of the feasible solution. 2

The following are conventions and assumptions made throughout the rest of this document: 1. Let τ = `(H) + `(P ) − 1 denote the number of possible relative times. 2. We assume that a parasite can be on only one host at a given time. 3. We index the nodes in a tree using consecutive integers beginning with 0, where the root node is numbered 0. 4. For clarity and consistency, we use i, j, and t as indices for parasites, hosts, and time, respectively. 5. For convenience, we use i ∈ L(P ) and j ∈ L(H) to indicate that i and j are the indices of leaves of P and H, respectively. Strictly speaking, L(T ) is the set of leaves of T rather than indices of leaves, but we use this shorthand to simplify our exposition. 6. When using an index x to represent a node in a tree, x0 represents the parent. Moreover, if x is an internal node, x1 and x2 represent the two children of x. We first introduce variables and constraints corresponding to a valid host tree. Variable 1 (Host Activity) Variable hj,t is true iff host j is “active” at time t, 0 ≤ j < N (H), 0 ≤ t ≤ τ . Constraint 1 (Root) The root of a tree can begin at any time, but the lifetime must be in a contiguous block of time. Thus, if the root is active at a given time t then it was either active at time t − 1 or it became active at time t and was never active prior. This is expressed by: h0,t → h0,t−1 ∨

t−1 ^

h0,u

u=0

for 1 ≤ t ≤ τ . Constraint 2 (Leaves) All leaves are active at time τ . This is expressed by hj,τ for all j ∈ L(H). 3

Constraint 3 (Contiguity) To ensure that each host, other than the root, becomes active at exactly the time that its parent expires and that the host remains active for a contiguous sequence of time, the following constraints are introduced where j 0 represents the parent of host j: hj,t → hj,t−1 ∨ (hj 0 ,t−1 ∧ hj 0 ,t ) hj,0 for 1 ≤ j < N (H), 1 ≤ t ≤ τ . The first constraint states that if host j is active at time t then either it was active at time t − 1 or its parent was active at time t − 1 but was no longer active at time t. The second constraint states that no non-root host was active at time 0. This prohibits a non-root host from occurring without a parent. Similarly, we construct analogous variables and constraints for the parasite tree: Variable 2 (Parasite Activity) Variable pi,t is true iff parasite i is “active” at time t, 0 ≤ i < N (P ), 0 ≤ t ≤ τ . Constraint 4 (Root) The root of a tree can begin at any time, but its lifetime must be in a contiguous block of time. Thus, if the root is active at a given time t then it was either active at time t − 1 or it became active at time t and was never active prior. p0,t → p0,t−1 ∨

t−1 ^

p0,u

u=0

for 1 ≤ t ≤ τ . Constraint 5 (Leaves) All leaves are active at time τ . This is expressed as pi,τ for all i ∈ L(P ). Constraint 6 (Contiguity) pi,t → pi,t−1 ∨ (pi0 ,t−1 ∧ pi0 ,t ) pi,0 for 1 ≤ i < N (P ), 1 ≤ t ≤ τ . Next, we introduce variables and constraints that represent the alignment of parasites onto hosts: 4

Variable 3 πi,j,t is true iff parasite i is on host j at time t for 0 ≤ i < N (P ), 0 ≤ j < N (H), 0 ≤ t ≤ τ . Constraint 7 (Final state) At the last relative time, τ , the mapping of parasites to hosts is defined by the given mapping ϕ. This is expressed as: πi,j,τ for i ∈ L(P ), j ∈ L(H), and ϕ(i) = j. Constraint 8 (Each active parasite is on exactly one host at a time) If a parasite is active at a given time then it must be on exactly one host at that time. This is expressed in two parts. First, if parasite i is active at time t then parasite i is on at least one host j at time t: pi,t →

_

πi,j,t

0≤j

Suggest Documents