Calculus Process Algebra for Problem Solving and

The $-Calculus Process Algebra for Problem Solving and Its Support for Bioinformatics Lara Atallah and Eugene Eberbach Computer and Information Science Dept., University of Massachusetts Dartmouth 285 Old Westport Road, North Dartmouth, MA 02747-2300 [email protected] , [email protected]

Abstract. In this paper a new technique for the solutions of hard computational problems in bioinformatics is investigated. This is the $-calculus process algebra for problem solving that applies the cost performance measures to converge to optimal solutions with minimal problem solving costs. We demonstrate that the $-calculus generic search method, called the kΩ-optimization, can be used to solve gene finding and sequence alignment problems. The solutions can be either precise or approximate by applying the $-calculus optimization or total optimization.

1 Introduction Bioinformatics (computational molecular biology) is a relatively new discipline, bringing together computational, statistical, experimental, and technological methods, which is energizing and dramatically accelerating the discovery of new technologies and tools for molecular biology [9]. It consists of several subareas, including gene finding, microarray analysis, sequence alignment, protein folding, and molecular docking [9,11,15]. The solutions of bioinformatics problems very often require searching through very large search spaces. Various techniques from computer science have been utilized to solve bioinformatics problems. They include dynamic programming, Hidden Markov Models, randomized algorithms, combinatorial pattern matching, divide-and-conquer algorithms, graph algorithms, and clustering and trees algorithms. Very often even the fastest supercomputers like the IBM Gene Blue are not good enough to effectively solve bioinformatics problems because of the enormous search spaces. Thus besides well recognized computer science heuristic and algorithmic methods, new techniques that deal with the intractability of search are needed. In this paper, we present one such new technique based on process algebras [10] and anytime algorithms [8]. This is the $-calculus of bounded rational agents [3,4,7,16] designed to provide support for solutions of intractable and undecidable problems. The $-calculus process algebra for problem solving applies the cost performance measures to converge to optimal solutions with minimal problem solving costs. Because of these features, we hypothesize that the $-calculus should be useful in solving hard computational problems in bioinformatics.

In this paper, we investigate a new application of the $-calculus to selected problems in bioinformatics. In particular, we investigate its application to gene finding and sequence alignment. The paper is organized as follows. In section 2, the brief primer of the $-calculus is presented. In section 3, the application of the $-calculus to gene finding is investigated to the solution of the exon chaining problem. In section 4, the sequence alignment is solved as the special case of the $-calculus very generic search method, called the kΩ-optimization. This includes both global and local, single and local sequence alignment solved in the same uniform way. Section 5 contains conclusions and problems to be solved in the future.

2 The $-Calculus Process Algebra for Problem Solving under Bounded Resources The $-calculus is a mathematical model of processes capturing both the final outcome of problem solving as well as the interactive incremental way how the problems are solved. The $-calculus is a process algebra of Bounded Rational Agents for interactive problem solving targeting intractable and undecidable problems. It has been introduced in the late of 1990s [3,4,7,16]. The $-calculus (pronounced cost calculus) is a formalization of resource-bounded computation (also called anytime algorithms), proposed by Dean, Horvitz, Zilberstein and Russell in the late 1980s and early 1990s [8,13] . Anytime algorithms are guaranteed to produce better results if more resources (e.g., time, memory) become available. The standard representative of process algebras, the π-calculus [10] is believed to be the most mature approach for concurrent systems. The $-calculus rests upon the primitive notion of cost in a similar way as the πcalculus was built around a central concept of interaction. Cost and interaction concepts are interrelated in the sense that cost captures the quality of an agent interaction with its environment. The unique feature of the $-calculus is that it provides a support for problem solving by incrementally searching for solutions and using cost to direct its search. The basic $-calculus search method used for problem solving is called the kΩ−optimization. The kΩ−optimization represents this “impossible” to construct, but “possible to approximate indefinitely” universal algorithm. It is a very general search method, allowing the simulation of many other search algorithms, including A*, minimax, dynamic programming, tabu search, or evolutionary algorithms [6]. Each agent has its own Ω search space and its own limited horizon of deliberation with depth k and width b. Agents can cooperate by selecting actions with minimal costs, can compete if some of them minimize and some maximize costs, or be impartial (irrational or probabilistic) if they do not attempt optimize (evolve, learn) from the point of view of the observer. It can be understood as another step in the never ending dream of universal problem solving methods recurring throughout all computer science history. The $-calculus is applicable to robotics [5], software agents, neural nets, and evolutionary computation [2]. Potentially it could be used for design of cost languages,

cellular evolvable cost-driven hardware, DNA-based computing [2] and bioinformatics, electronic commerce, and quantum computing. The $-calculus leads to a new programming paradigm cost languages [6] and a new class of computer architectures cost-driven computers.

2.1. The $-Calculus Syntax In $-calculus everything is a cost expression: agents, an environment, communication, interaction links, inference engines, modified structures, data, code, and metacode. $-expressions can be simple or composite. Simple $-expressions α are considered to be executed in one atomic indivisible step. Composite $-expressions P consist of distinguished components (simple or composite ones) and can be interrupted. The $-calculus process expressions consist of simple $-expressions α and composite $-expressions P, and is defined by the following syntax: α ::= ($i∈I Pi) | (→i∈I c Pi) | (←i∈I c Xi) | (‘i∈I Pi ) | (a i∈I Pi) | (¬a i∈I Pi)

P ::= | | | | |

(°i∈I α Pi ) (|| i∈I Pi) (i∈I Pi) (⊕i∈I Pi) ([]i∈I Pi) (f i∈I Pi)

compute cost of Pi send Pi with evaluation through channel c receive Xi from channel c suppress evaluation of Pi defined call of simple $-expr. a with parameters Pi negation of defined call of simple $-expression a

sequential composition parallel composition cost choice, select Pi with the smallest cost adversary choice, select Pi with the largest cost general choice, select Pi randomly or based on condition defined process call f with parameters Pi, and its associated recursive definition (:= (f i∈I Xi) R) with body R

The indexing set I is a possibly countably infinite. In the case when I is empty, we write empty parallel composition, general, cost and adversary choices as ⊥ (blocking), and empty sequential composition (I empty and α=ε) as ε (invisible transparent action, which is used to mask, make invisible parts of $-expressions). Adaptation (evolution/upgrade) is an essential part of $-calculus, and all $-calculus operators are infinite (an indexing set I is unbounded). The $-calculus agents interact through sendreceive pair as the essential primitives of the model. Sequential composition is used when $-expressions are evaluated in a textual order. Parallel composition is used when expressions run in parallel and it picks a subset of non-blocked elements at random. Cost choice is used to select the cheapest

alternative according to a cost metric. Adversary choice is used to select the most expensive alternative according to a cost metric. General choice picks one nonblocked element at random. General choice is different from cost and adversary choices. It uses guards satisfiability. Cost and adversary choices are based on cost functions. Call and definition encapsulate expressions in a more complex form (like procedure or function definitions in programming languages). In particular, they specify recursive or iterative repetition of $-expressions. Simple cost expressions execute in one atomic step. Cost functions are used for optimization and adaptation. The user is free to define his/her own cost metrics. Send and receive perform handshaking message-passing communication, and inferencing. The suppression operator suppresses evaluation of the underlying $-expressions. Additionally, a user is free to define her/his own simple $-expressions, which may or may not be negated.

2.2. The $-Calculus Semantics: The kΩ-Optimization Search In this section we define the operational semantics of the $-calculus using the kΩoptimization search that captures the dynamic nature and incomplete knowledge associated with the construction of the problem solving tree. The performance of search algorithms can be evaluated in four ways [Russell95] capturing whether a solution has been found, its quality, and the amount of resources used to find it. We say (see, e.g., [13]) that the search algorithm is • Complete if it guarantees reaching a terminal state/solution if there is one. • Optimal if the solution is found with the optimal value of its objective function. • Search Optimal if the solution is found with the minimal amount of resources used (e.g., the time and space complexity). • Totally Optimal if the solution is found both with the optimal value of its objective function and with the minimal amount of resources used. The basic $-calculus problem solving method, the kΩ-optimization, is a very general search method providing meta-control, and allowing to simulate many other search algorithms, including A*, minimax, dynamic programming, tabu search, or evolutionary algorithms [13]. The problem solving works iteratively: through select, examine and execute phases. • In the select phase the tree of possible solutions is generated up to k steps ahead, branching factor b, and an agent identifies its alphabet of interest for optimization Ω. This means that the tree of solutions may be incomplete in width and depth (to deal with complexity). However, incomplete (missing) parts of the tree are modeled by silent $-expressions ε, and their cost estimated (i.e., not all information is lost). The above means that kΩ-optimization may be if some conditions are satisfied to be complete and optimal. • In the examine phase the trees of possible solutions are pruned minimizing cost of solutions. • In the execute phase up to n instructions are executed.

Moreover, because the $ (cost) operator may capture not only the cost of solutions, but the cost of resources used to find a solution, we obtain a powerful tool to avoid methods that are too costly, i.e., the $-calculus directly minimizes search cost. This basic feature, inherited from anytime algorithms, is needed to tackle directly hard optimization problems, and allows to solve total optimization problems (the best quality solutions with minimal search costs). The variable k refers to the limited horizon for optimization, necessary due to the unpredictable dynamic nature or complexity of the environment. The variable Ω refers to a reduced alphabet of information. No agent ever has reliable information about all factors that influence all agents behavior. To compensate for this, we mask factors where information is not available from consideration; reducing the alphabet of variables used by the $-function. By using the kΩ-optimization to find the strategy with the lowest $-function, meta-system finds a satisficing solution, and sometimes the optimal one. This avoids wasting time trying to optimize behavior beyond the foreseeable future. It also limits consideration to those issues where relevant information is available. Thus the kΩ-optimization provides a flexible approach to local and/or global optimization in time or space. Technically this is done by replacing parts of $-expressions with invisible $-expressions ε, which remove part of the world from consideration (however, they are not ignored entirely – the cost of invisible actions is estimated). If some conditions are satisfied [7], the kΩ-optimization guarantees to find optimal, search optimal or totally optimal solutions.

3 Gene Finding Proteins are the end product of the gene, and thus the expression of the genes. The Gene sequences are composed of exons, which are sequences that determine the encoding of proteins and introns which are simply sequences of what we think is a junk. The DNA is formed of unpredicted intervals of introns and exons where some exons are usually delimited by dinucleotides, AG on the left and GT on the right. However there are stop codons (TAA, TAG and TGA) that break the exon sequence. The challenge is to find those genes and to be able to predict their sequences. Different approaches have been used to solve this problem. The statistical approach, the similarity based approach and the hybrid approach. The statistical approach looks for frequency of some sequences that are highly conserved like the exons delimited by (AG and GT). This approach, use only information about the input sequence itself to identify likely splice sites and to detect differences in sequence composition between coding and non-coding DNA. The similarity approach which is based on finding the similarity between the human genome and another related organism’s genome, in most cases the mouse genome of which there exists now a draft. The hybrid approach is a mix of the similarity based approach and a sequence alignment. Traditionally the problems of gene finding and alignment have been treated separately, but it’s becoming more and more apparent that they are closely

connected. The gene finding can be helped by aligning the sequences, and the alignment can be helped by actually first locating the genes. One goal in this paper is to show the reader how we modeled problems in different bioinformatics areas with the $-calculus. It is often thought that human gene recognition has been achieved with a relative point of accuracy if a related protein in a mouse genomic sequence is found. This requires examining spliced sites and using spliced alignment algorithms [9]. A simple problem that one can face when working in the gene finding problem is that given a set of putative (probable) exons , find a maximum set of non overlapping putative exons. A putative exon is modeled as a weighted interval (l,r,w), where l is the left position, r is the right position and w is a score that reflects the possibility that this sequence is an exon. The set of candidate exons are usually located between donor and acceptor sites. 1

2

3

4

1

5

6

7

8

9

3

8

10

11

12

5 7 4

Figure 1: Potential candidates for exons forming a gene There are 6 intervals: (1,3,1), (2,5,8), (4,7,3), (6,10,7), (8,11,5), (9,12,4). The goal is to find the set of non-overlapping exons maximizing the sum of weights. Modeling the exon-chaining problem in $-calculus depends on building the tree with maximum depth k and branching factor b equal to the number of intervals. Intervals that overlap are combined by cost choice operator, intervals that do not overlap are combined by sequential composition operator. We will use only 3 operators from $-calculus cost choice , sequential composition ° in the select phase, and cost operator $ in the examine phase to prune the search tree. We assume that we use a standard cost function [7], i.e., $( a b)=min($(a),$(b)) and $(° a b)=$(a)+$(b). Because we use cost choice performing minimization we flip values of weights to negative ones (the alternative would be to use an adversary choice operator and to keep positive weights). This leads to the following $-expression constructed in the select phase of the kΩoptimization: ( (° (1,3,-1) ( (6,10,-7) (° (4,7,-3) ( (8,11,-5) (9,12,-4))))) (° (2,5,-8) ( (6,10,-7) (8,11,-5) (9,12,-4))) ) In the examine phase, the tree is pruned (the cheapest paths selected) by using cost operator: $( (° (1,3,-1) ( (6,10,-7) (° (4,7,-3) ( (8,11,-5) (9,12,-4))))) (° (2,5,-8) ( (6,10,-7) (8,11,-5) (9,12,-4)))) = min ($(1,3,-1) + $(4,7,-3) + min($(8,11,-5),$(9,12,-4)), $(1,3,-1) + $(6,10,-7), $(2,5,-8) + min($(6,10,-7),$(8,11,-5),$(9,12,-4))) = = min(-1-3-5, -1-7,-8-7) = -15

The above means that in the execute phase of the kΩ-optimization two sequences (2,5,-8) and (6,10,-7) will be left as potential candidates for the non-overlapping exons optimizing the score, and this is consistent with the results using the conventional exon-chaining algorithm [9]. In other words, the optimal chain of nonoverlapping exons is described as the following $-expression (° (2,5,-8) (6,10,-7)).

4 Sequence Alignment Sequence alignment is comparing two or more sequences by looking for character patterns that appear in both or in all the sequences, in the same order. During the course of evolution, mutations occurred, creating differences between families of contemporary species. Those mutations are: – Insertion - insertion of one or several letters to the sequence. – Deletion - deleting a letter (or more) from the sequence. – Substitution - replacing a sequence letter by another. When we compare two sequences, we are looking for evidence that they have diverged from a common ancestor by a mutation process. There are two kinds of alignments global and local. 4.1

Global Sequence Alignment

The global alignment [12] compares entire sequences until from end to end. Example: Global alignment :

AGCNS–RQCLC–R–M NT PQ | |

|

|

| | |

|

AG–PS -R F CLC–PTO L NP 4.1.1 Formal Description with gap penalties The following is a formal description of aligning 2 sequences V and W: Let F(i,j) be the optimal alignment score of V= x1…xi and W= y1…yj (1