Biosequence structure searching on the Web

StructWeb: Biosequence structure searching on the Web using clp(FD)

David Gilbert

Ingvar Eidhammery August 1997

Inge Jonassenz

Abstract

We describe an implementation in a nite domain constraint logic programming language of a web-based biosequence structure search program. We have used the clp(FD) language for the implementation of our search engine and have ported the PiLLoW libraries to clp(FD). Our program is based on CBSDL, a constraint based structure description language for biosequences, and uses constrained descriptions to search for the structures in biosequences, such as tandem repeats, stem loops, palindromes and pseudo-knots. We also discuss issues encountered in porting the PiLLoW libraries to clp(FD) and the design and construction of the Web interface that we have constructed for our search engine. Keywords: constraints, biostructures, description language, searching, WWW interface.

1 Introduction In this paper we report an implementation of a web-based biosequence structure searching program. This implementation has been constructed using the nite domain constraint logic programming language clp(FD) [13] together with our own port of the PiLLoW library [10] to clp(FD). Our search engine is based on a CBSDL, a constraint based structure description language for biosequences described in [15] and uses constrained descriptions to search for structures in biosequences, such as tandem repeats, stem loops, palindromes and pseudo-knots. Our motivation is to provide a tool for use by molecular biologists engaged in biosequence structure analysis. We have made the tool accessible over the Web in order to facilitate testing, since this enables us to make the latest version available and also to avoid the need to create and install stand-alone versions for a variety of platforms.

Constrained patterns

The goal underpinning our research is the development of our search engine is to investigate how constraint solving techniques can be used to search for structural patterns in sequences (or strings) of symbols over a nite alphabet . The main motivation is searching in biological sequences, and also in providing high-level descriptions of biosequence database contents, but we believe that programs for searching for such patterns also might be useful in other areas as well, e.g. signal processing or treating of acoustics data. We de ne a pattern as consisting of a logical expression on components and a set of unary and binary constraints on the components where a component is a description of a string of symbols. An input string S matches a pattern if for each component it contains a substring matching that component, such that all the constraints are satis ed. y z

Department of Computer Science, City University. London UK, [email protected] Department of Informatics, University of Bergen, Norway, [email protected] Department of Informatics, University of Bergen, Norway, [email protected]

1

2

c a-u g-c u-a c-g augc ggcau

g u-a a-u g-c c-g aggc ccgu

x o-o o-o o-o o-o xx xx

(i)

(ii)

(iii)

x o-o o-o o-o c-o xx xx (iv)

Figure 1: Illustration of structures and structural patterns: (i) and (ii) show two examples of structures (stem loops) that might be equivalent in RNA molecules. Figure (iii) shows a possible representation of a pure structural pattern matching the structures (i) and (ii). The pattern can also be called a consensus for (i) and (ii). The o and x symbols each match any one nucleotide symbol, and pairs of o symbols connected with a dash (-) should match pairs of symbols that can base pair. Figure (iv) shows a structural pattern equivalent to the pattern shown in (iii) except that the rst nucleotide in the rst part of the stem has to be a c. A pattern can contain constraints of the ve following types on the: 1. length of a substring to match a speci c component, 2. distance (in the input string) between substrings to match the dierent components of a pattern, 3. contents of a substring to match a component, e.g. the second symbol should be an a or a t. 4. positions on the input string where a particular component can match, 5. correlation between two substrings matching dierent components, e.g. the substrings should be identical, or the reverse of each other. We also de ne two associated classes of patterns: Sequential: patterns which do not include a correlation constraint. The patterns in the PROSITE data base [4] are examples of this class, for example [AC]-x(2,3)-D describing a pattern comprising three components, the rst being an A or a C, the second of length 2 or 3 and the last consisting of a D. The PROSITE [4] language can describe such patterns. Structural: patterns having at least one correlation constraint and none or more content constraints. One example is repetition, where the substrings matching two dierent components must be identical. Another example is a palindrome, two consecutive substrings of equal length must be the reverse of each other.

2 Biological motivation The primary structure of biological macromolecules, DNA's, RNA's, and proteins, can be coded as sequences (strings) over an alphabet of four bases fa, c, g, tg for DNA and fa, c, g, ug RNA, or 20 nucleotides for proteins. Conventionally alphabets for DNA/RNA are lower case and protein in uppercase. In nature, DNA normally forms a double helix where a base in one strand is bonded to a complementary base (forming bonds a-t and g-c) in the other strand. The bases in RNA molecules can form bonds a-u and g-c in a similar way. RNA and protein molecules fold into 3 dimensional structures enabling them to perform their structural/functional role in the cell. The structures can be described at dierent levels. The primary structure is simply the sequence. For RNA molecules, the secondary structure is the collection of base pairs which are formed in the folded molecule, and the tertiary structure is the complete 3 dimensional structure of the folded molecule. For describing patterns in RNA sequences, one needs to include dependencies between individual letters because the base-pairing interactions play a dominant role in determining RNA structure and function [23]. We call such patterns of dependencies structures in the sequences, and note that such patterns can be described using structural patterns as de ned in the Introduction. See Figure 1 for an illustration. We can also describe more complex structures found in RNA molecules such

3 as clover-leafs and pseudo-knots. Interesting structures are also found in DNA sequences (e.g., repeats and palindromes) that can be described using structural but not sequential patterns. Automatic methods have been developed to discover characteristic sequential patterns for protein families (for a survey, see [8]). Using structural patterns, we believe that analogous methods can be developed for RNA sequence families. As well as being useful for classi cation purposes, the discovered pattern also help in understanding the relationships between sequence, structure, and function of the proteins or RNA molecules under study. The development of ecient and general methods for matching and discovering such patterns will help progress in molecular biology.

Example structures

In the structure descriptions below, ; (with or without indices) are pattern components, r is the reverse of , c is the complement of , and rc is the reverse complement of . Below we give examples of types of structures found in the literature [24, 6], and for each type we give one example. All examples are from DNA/RNA sequences. Tandem repeat acgacg Single repeat acgaaacg Multiple repeat 1 acgaaacguuacg Stem loop rc acgaacgu rc Attenuator 1 acgaacguauacg Palindrome, even r acggca Palindrome, odd xr acgagca rc rc Pseudoknot 1 2 1 1 2 2 acgaaucugccguauaaga More complicated structures can be obtained by combining the ones above, e.g. clover-leafs.

3 The structure language A pattern in our language is de ned by a structure speci cation , which is of the form

S; c1 ; : : : ; cn

where S is a string expression and c1 ; : : : ; cn is a set of constraints. The string expression speci es the components, or string variables (denoted by the Greek letters ; ; ; : : : and possibly subscripted), taking part in the pattern, and a logical expression on them using conjunction, disjunction and negation. We follow the convention that A set of constraints can contain constraints over the ve types: length, distance, content, position and correlation constraints, which are described below. In addition, we permit equality and inequality operations over the integer components of the constraints, with the usual arithmetic operations over integers, addition, subtraction, multiplication and integer division. We further allow the user to describe complex structures by conjoining structure speci cations. A length constraint restricts the length of a string variable to be within a particular range, and has the form length(, L) and L ranges over the positive integers such that the length of is constrained to be within the range of L. We permit the length of a string variable to be 0 in order to be able to describe null-strings. Furthermore, we introduce two variants, maxlength(, L) and minlength(, L) such that the length of is the maximum, respectively minimum value possible within the range denoted by L according to some mapping to a given input string. Redundant matches are avoided in the case of e.g. stem loops where substrings of the stem are not required. A distance constraint restricts the distance between two string variables, and are speci ed in a declarative and uniform way, e.g. start start(; ; D), end start(; ; D), start end(; ; D), end end(; ; D) where and are string variables and D ranges over the integers. These relations constrain the distance between the start of and start of (respectively end of and start of , start of and end of end of and end of ) to lie within the range denoted by D. A negative value for D indicates that the point of reference of occurs after the corresponding point of reference of in the input string. We also permit the shorthand : to indicate that starts directly after .

4 This shorthand is equivalent to ^ , end start(, , 1) . A content constraint restricts which symbols can be in a speci c position on a string variable matching a component and is expressed thus: content(,Pos,Set) where is a string variable, Pos is a positive or negative (non-zero) integer representing Pos , the character from at position Pos from the start (or end if Pos is negative) of , and Set is a (non-empty) set of characters to which Pos may be bound, e.g. fa,tg. A position constraint restricts the absolute positions of a string variable on the input string and is expressed as start(,P) or end(,P) where is a string variable and P ranges over the positive integers such that the rst (respectively last) character of is located at position P on the input string. A correlation constraint (\correlation" for short) de nes the relation between the contents of two string variables. A correlation C has the following properties: It relates two string variables C (; ), The length of the two string variables must be equal. There is a direction-component Cd, written as the relation Cd(; ). The two legal values for Cd are 1 and -1. 1(; ) is satis ed i (8i : 1 i h : i is related to i ). ?1(; ) is satis ed i (8i : 1 i h : i is related to h?i+1 ), where i and i are symbols from and , and h is the length of the matching substrings. Note that this means that all positions of the string variables take part in the correlation. There is a symbol-component Cs. As part of this component a function Cf is de ned from to 2 . Cs (; ) is satis ed i (8i : 1 i h : i 2 Cf (i )) Let L be the language of all strings with symbols from . The correlation C (; ) is satis ed i 9x : x 2 L : Cd (; x) ^ Cs (x; ). Furthermore, we de ne a notion of approximate matching, given as an argument to the appropriate correlation constraints. This argument ranges over the interval 0..100 and represents the percentage mismatch between two string variables; when the mismatch is zero then we can omit this argument. We can use Hamming distance [17], edit distance or more generally Levenshtein distance [20] in order to implement approximate matching1 . We de ne id(, ) and reverse(, ) as general correlation constraints over all alphabets, where is the identity (respectively, reverse) of , and assume that there is a library of correlations, and that a user may add a new correlation to the library use a known correlation, or use (without storing in the library) an unnamed correlation in a speci cation. A correlation is thus de ned by two arguments, the direction and the symbol component. For example the de nition of the reverse complement for the DNA-alphabet is rev compl DNA(?1; fA ! fT g; C ! fGg; G ! fC g; T ! fAgg). Pre-de ned correlations might be in a library.

Example structure descriptions

is

A description, or structure speci cation, of the stem loop using exact matching in Figure 1(iv)

: : , maxlength(, 4), length( , 1), content(, 1, fcg), rev compl RNA(, ) assuming a library de nition of rev compl RNA as above, and where and form the stem, with

the loop.

User queries can be formulated in this \raw form" where an input string is appended to a structure description and some mapping algorithm used to map the description to the string. Thus the user may enter the following query:

: : ; maxlength(, 4), length( , 1), content(, 1, fcg), rev compl DNA(, ), tatacctgtcaggtata

which will result in being mapped to the substring cctg starting at position 5 and ending at 8, to cagg starting at 10 and ending at 13, and to t at position 9. Queries may be optionally prefaced by a description of the alphabet of characters which are permitted in the input string. In order to improve the usability of the language we have de ne a macro facility permitting the 1 Minimum transformation costs calculated for: Hamming distance: substitution only, edit distance: insertion and deletion only, Levenshtein distance: substitution, deletion and insertion.

5 user to store and re-use de nitions of, i.e. grammars for, speci c structures. The syntax of this macro language is similar to that of logic programs; the following grammar de nes a language for pseudo knots: pseudoknot(, , ,):- :!1 : :!2 : :!3 : , rev compl RNA(, ), rev compl RNA( ,). Such descriptions can be parameterised by the lengths of the components of the structure and the mismatch, eg: stemloop(, , , StemLength, LoopLength, Mismatch):: : , rev compl RNA(, , Mismatch), maxlength(,StemLength), length( , LoopLength), maxlength( ,StemLength). Note that in descriptions of stemloops we are interested in nding those structures with the maximal possible length of the stem in order to avoid matching on to many substructures of that stemloop. We are not, however, interested in de ning maxlength over the loop, since if we did so then we would possibly omit several dierent structures. For example, given the sequence ugcucaaaagagcuaaagagcu 1234567890123456789012

and an attempt to match with a de nition stemloop(, , , StemL, LoopL, 0), 3 StemL , StemL 7, 1 LoopL , LoopL 20 two dierent stem loops should be identi ed, i.e. (1) gcucaaaagagc starting at 2 and ending at 13 where = gcuc, = aaaa, = gagc (2) gcucaaaagagcuaaagagc, starting at 2 and ending at 21 where = gcuc, = aaaagagcuaaa,

= gagc

4 Implementation of the search engine using constraint logic programming Language representation in clp(FD)

We represent components, which we term here string variables by sequences with maximum length m of string-characters . These comprise pairs whose rst element Chars is a set of characters drawn from some alphabet A (of bases or nucleotides) and whose second element Pos is a set of integers in 1. . . m. Each pair represents the possible values of the characters to be found on the input string at the locations indicated by the second element of the pair. Moreover we assume that the successor relation holds between the second elements of neighbouring members of the sequence in the normally accepted direction of ordering. A (suitably constrained) string variable is thus schema for a structure, and can be instantiated by matching against an input string (see below). We have chosen constraint logic programming over nite domains [19] as a paradigm for implementation because of the declarative nature of our structure language and the use which it makes of nite domain constraints. In our implementation sequences are represented as lists, and thus string variables comprise lists whose elements are pairs of (Chars,Pos). We choose also to map alphabets onto (dense subsets of) natural numbers, so that for example for DNA we represent a, c, g, t by 1, 2, 3 and 4 respectively. In this way we can use any nite constraint logic programming language which does not permit operations over arbitrary nite domains. We have used clp(FD) [13] as the basis for our implementation because it has a specialised operation for complementation over genomic alphabets (see below). Moreover, the clp(FD) system is freely available, small in size and can compile to executable code. Ideally we would also like to be able to use a string solver, along the lines of [29], [16] or [22].

6 Length constraints are de ned in the usual backtracking manner for lists although ideally we would like to use a list solver (for example [22]). Distance constraints are de ned simply by referring to the position elements of character pairs: Content constraints are implemented by imposing constraints on the integer sets representing the characters using the sparse representation of nite domain variables in clp(FD) to describe non-continuous domains. Position constraints are straightforwardly implemented by constraining the position element of a string-character pair. General correlation constraints (those independent of any alphabet) are coded in clp(FD) as follows. The id constraint constrains the corresponding characters in the string characters pairs to be equal. Note that the position elements in each corresponding pair are not constrained by this relation, since the string variables may be mapped to dierent places on the input string. The reverse constraint rst of all reverses one of the string variables and then constrains it to be identical to the other string variable. Approximate matching between string variables is implemented using Hamming distance and relating this to the length of the list representing the string variable. Complementation constraints are implemented using a specialised solving routine compl/4 in clp(FD). For example RNA, whose alphabet a, c, g and u we represent by 1, 2, 3 and 4 respectively, has complements fa!fug, c!fgg, g!fc,ug, u!fa,ggg. We represent this by

complement_char(Char1,Char2):compl(Char1,1,Char2,[4]), compl(Char1,2,Char2,[3]), compl(Char1,3,Char2,[2,4]), compl(Char1,4,Char2,[1,3]).

where the de nition of compl/4 is

compl(A, Char, B, Chars):A=Char Val1, B in Chars Val2, Val1 in 0 .. max(Val2), Val2 in min(Val1) .. 1.

The search engine

The function of a processor for our language is to match a structure description on to an input string, in order to determine the contents and locations of those substrings of the input string which match the components of the description. Thus a solution to a mapping of a string expression onto an input string is a valuation (an assignment to each constraint variable in the string expression of one value from the domain of the variable) such that all the constraints are satis ed. Each element of all string-character pairs must be a singleton set satisfying the constraints on that element; an empty set indicates a failure to produce a solution. In our problem domain we are interested in producing all the solutions (mappings) possible of a given string expression onto an input string. An input string I comprises a sequence of characters drawn from some alphabet A (of bases or nucleotides); we limit the maximum length of any string to be less or equal to some maximum integer m. In order to perform mapping we rst convert the input string into a string-variable, i.e. a list whose elements are pairs of (Chars,Pos). For example the RNA sequence of agt of bases is converted to the list [(f1g,f1g),(f3g,f2g),(f4g,f3g)] using our numeric representation of the base alphabet. We have de ned a naive procedure to map a speci cation Spec (i.e. a constrained string expression SE ) onto an input string I using backtracking. We assume two types of correlation: c (normal correlation) and r (reverse correlation), and a function p1: x y ! x. variables eciently.

for each pair of string variables (; ) in SE correlated by correlation c do nd members of I s.t. 1 = Ij and 1 = Ik and set i = 1

while c(p1(i ); p1( i)) and i length() do

i := i + 1 and j := j + 1 and k := k + 1 i = Ij and i = Ik

end end for each pair of string variables (; ) in SE correlated by correlation r do

7 set l = length( ) nd members of I s.t. 1 = Ij and l = Ik and set i1 = 1, i2 = l while c(p1(i1 ); p1( i2 )) and i1 length() do i1 := i1 + 1 and i2 := i2 ? 1 and j := j + 1 and k := k ? 1 i1 = Ij and i2 = Ik

end end

Since our program is compiled to native code without an emulator or top-level query evaluator in clp(FD), we generate all the possible matches between a string variable and an input string by a failure-driven loop. This avoids the need to write our own query evaluator with interactive backtracking on user input of `;'. Moreover since our searches are computationally expensive we do not use setof/3 in order to collect solutions, and prefer to let the user the ability to abort the computation if he thinks that too many solutions are being produced, or too much time is being taken by the computation.

5 Interfacing the search engine to the WWW Why a Web interface?

A Web interface proved an attractive proposition due to several factors: ease that which a user-friendly nature that could be provided, the freedom from multiple architecture considerations and the fact that system updates can be made available instantaneously:

User interface: The interface for the stand-alone system is not attractive to users since they

usually want to make repeated queries with small changes in parameters; some kind of form-like interface would be ideal for this where user-entered data values are preserved between queries. The Web provides a very easy way to construct such an interface.

Architectures: The non-Web version has to be recompiled for dierent architectures/operating

systems on which it can be run (and there will always be one user who has machine for which the clp(FD) compiler has not been ported. The Web version allows us to compile the program for one architecture only { our server.

Testing: At present the program is in a Beta-test stage; we need to get feedback quickly from potential users and would like to do this without having to physically install it on their systems. Updates: As the program is being updated rapidly, we would like to make these updates immediately available to users; this is really where the Web is the ideal mechanism to achieve this.

Hence in order to make our program more accessible and to make the latest version of the search engine available we have constructed a user interface accessible via a Web browser capable of handling HTML forms. We have given a default query data set and input string in the form so that the naive user has an example query to experiment with. Interfacing to the search engine is achieved by using the PiLLoW libraries which permits the user to enter descriptions of the structures that he is interested in, to initiate a mapping operation and then will return the results of the mapping to the user. The queries are handled by a query evaluator, which checks query parameters, expands macros, and translates the queries into an internal form. This form is passed down to the constraint search engine which sets up the data structures, imposes the constraints on them and uses a matching algorithm to solve the constraints. Results of matching are output as the strings found and their locations of strings, and optionally

8 the strings themselves; we plan to enhance the system with some graphical representation of the structures found.

Forms and CGI interface The general issues of making applications accessible using the Web are covered in various texts, see for example [9]; the PiLLoW system is described in [10] and contains a detailed description of methods for interfacing logic programming systems to the Internet/WWW. Brie y, an Internet client can invoke a program on a server via a browser, for example, by sending the URL of the program to the server (as long as the program has a recognised extension, .cgi, and the right permissions are set). Such a program is called a CGI (Common Gateway Interface) program. Output from the invocation is returned to the client and must be in the form of an HTML page if it is to be interpreted by a browser. The main challenge is that of permitting data from the client to be sent to the server which is then accepted as input by the program on the server. Sending data from the client to the server-side program can be accomplished using HTML forms which permit the user to enter or select values for elds which may be text or numeric in type. There are two methods for actually sending this data: GET and POST. In the former the data is appended to the URL of the CGI program and is then put into an environment variable called for example QUERY STRING. The advantage of the GET method is its relative simplicity and the fact that the query information is visible at the client side as an addendum to the URL of the CGI program. The disadvantage of this method is that the environment can run out of space when a large amount of data is sent. In the POST method on the other hand some information is put into environment variables, for example the number of bytes of the actual data, and then the data is sent to the CGI program as standard input. This program must pick up from the environment the information about the length of the data contents since there is no distinguished character sent to indicate the end of the data stream, and then use the length to read the rest of the data byte by byte. The POST method is the best suited to situations when a relatively large amount of data is sent by the server, and our implementation makes use of this method since we provide a specialised query engine and users provide both a query (in the form of the required parameters) and the input string { see Figure 2. The form can be found at http://www.soi.city.ac.uk/ drg/cgi-bin/struct-form.html, and sample databases at http://www.ebi.ac.uk/srs/srsc. The form comes pre- lled with data, and when reset will be re- lled with this data. Numeric eld may be left blank or lled in; the less constrained the query is, the longer the processing will take. A disadvantage with the CGI interface is that no client-side processing is performed, and thus failures can potentially will occur if, for example, the elds are lled in incorrectly. Our program checks the types and values of input data and returns a `failed query' message to the user if any data violation occurs. The alphabets that the program can process are: DNA: a,c,g,t with complements a-to-t, t-to-a, c-to-g, g-to-c, and RNA: a,c,g,u with complements a-to-u, c-to-g, g-to-c, g-to-u, u-to-a, u-to-g. Radio buttons are used to make the selection of the alphabet (default is DNA). The program will accept searches for RNA structures using an alphabet of a,c,g,t but will translate t to u and then use the RNA complementation. Output in this case is using the a,c,g,u alphabet. Structures are of the form |---------|------|---------| A B C

where A and C are correlated regions of equal length and B is a `spacer'. Types of target structures that can be searched for are: Stem loop (default), Repeat, Inverse, with selection by radio button. These structures are based on the following correlations:

9

Figure 2: Input form identity giving repeat structures, reverse giving inverse structures (palindromes), complement giving stem loops when combined with reverse . Correlated regions are always of equal maxlength (measured in nucleotides), and may be speci ed within a minimum and maximum range, and within a percentage mismatch range based on Hamming's distance. The length of the spacer region, measured in nucleotides, may similarly be speci ed within a minimum and maximum range. as can the total length of the structure. This potential redundancy in information gives the user the freedom to omit some of the parameters as he sees t. The position on input string where the structure is to be searched for may be speci ed to start and end at either exact positions or within that range (i.e. as a window). The input string itself can be given with one or more lines each optionally prefaced by an integer indicating the start position. If the rst line is prefaced by an integer, then that value is taken to be the initial start position of the string, otherwise the default is that the string starts at position 1. Spaces are ignored in the input string. Output comprises a repetition of the query parameters and the input string, followed by those structures found (if any) in the form:

10 Mismatch percent start=position , end=position start=position , end=position , start=position , end=position , start=position , end=position ,

correlated region (1) spacer region correlated region (2) For example, the structures found for of the query described in Figure 2 are:

Stemloop Mismatch 0% start=105, end=138 start=105, end=109, start=110, end=133, start=134, end=138,

gtatt ttctcagtctaatttttgcgtatt aatac

Mismatch 0% start=169, end=204 start=169, end=173, start=174, end=199, start=200, end=204,

aaaaa actaataattttatgaaattaaataa ttttt


aaaaa ctaataattttatgaaattaaataat ttttt


taacc aaaactaggtaatttatccggtcaaa ggtta


aaaaaa ctaataattttatgaaattaaataa tttttt


taaatta gtctccaaaattaaccaaaactagg taattta

Using the PiLLoW library

We use the PiLLoW library [10] to access the data sent by the client via the form; speci cally we use the get form input(InputList) which translates input from the form to a dictionary Dic of attribute=value pairs. It translates empty values (which indicate only the presence of an attribute) to `$empty', values with more than one line (from text areas or les) to a list of lines as strings, the rest to atoms or numbers. The get form value(Dic,Var,Val) routine is then used to get the value for Val into Var and also we employ the text lines(Val,Lines) routine which transforms a value given by a dictionary to a list of lines, for data coming from a text area.

11

Porting the PiLLoW library to clp(FD) The PiLLoW library has been well-designed to make the task of interfacing logic programs to the Web easy. Our application required only a (signi cant) subset of this library { the routines associated with accessing the data sent to a CGI program by a client program. However, the library had not been ported to clp(FD), and this we had to do. The rst problem in making the port was that the PiLLoW libraries make extensive use of de nitions written in DCGs; clp(FD) does not have DCG expanders written into its clause reader. We got around this by reading in the libraries using SICStus Prolog, and then listing consulted programs to le; an unwelcome side-eect was that the comments and meaningful variable names were lost, as well as the code growing in size. Secondly, the code in the original PiLLoW libraries does not completely conform to the ISO Prolog standard [2, 12], whereas clp(FD) is compliant. Speci cally, we changed all occurrences of atom chars/2 and number chars/2 in the original code to atom codes/2 and number codes/2 respectively. The environment is accessed in a cleaner way in clp(FD) than in the original libraries, and thus we were able to use the following routine: getenvstr(Name,Content):unix(getenv(Name,String)), atom_codes(String,Content).

as opposed to the original getenvstr(Var,Val) :name(Var,VS), append("echo $",VS,SCommand), name(Command,SCommand), unix(popen(Command,read,S)), get_line(S,Val), Val = [_|_].

Finally, the PiLLoW system and clp(FD) are both module-based but, as one might expect, employ a dierent syntax, forcing us to modify the code accordingly. In summary, however, the PiLLoW libraries required few modi cations to make the port to clp(FD) { although our task would have been easier if we had had a manual or more complete documentation for the system.

Good software engineering practice Two varients of our system exist: one with a Web interface and the other with a simple teletype interface; both utilise the same search engine. We have made use of the module facility of clp(FD) in order to ensure that both varients use the same version of teh search engine and thus we keep the search engine code separate from the teletype interface and web-interface routines, allowing the engine to be linked to either interface. Moreover we use RCS, the Revision Control System, in order to be able to control the generation of versions and to be able to back out of a version if it proves to be awed. We have taken advantage of the ability of RCS to automatically number versions and to insert this information in the source code, thus enabling users to identify to us the version of the program which they are using for feedback purposes.

6 Testing Our search engine source program is 388 lines (10K) of clp(FD) code; we have compiled our program to 370K of stand-alone sun-sparc code using the clp(FD) system [13], and have used this to

12 test the detection of stem-loops from a variety of databases, including entry with ID CXSTPLUC2 (accession number X87994) from the EMBL nucleotide sequence database release 49 (Nov 1996), URL: http://www2.no.embnet.org/srs/srsc?[EMBL-id:CXSTPLUC2]+-sf+GCG. For example our program took 40 ms on a Sun IPX to nd the stem-loop cccgtcca, gctcggct, tggacggg at position 20{43 (perfect matching), and 90 ms to nd the stem-loop cagctcg, gcttgga, cgggctg at position 26{46 (mismatch of 14%) in a string of nucleotides from positions 1{60. More complete test results can be found in [14]. Readers can access the Web version of our program at http://www.soi.city.ac.uk/ drg/cgi-bin/struct-form.html We should emphasise that our intention in constructing our system is not to challenge existing and well-established programs, but rather to demonstrate that a declarative language can give concise descriptions of structures based on general high-level speci cations.

7 Other approaches Several programs have been developed for searching for sequences for the presence of patterns, and can be divided into two types, special and general purpose. Special purpose programs are designed to search for speci c patterns, e.g. candidates for trans-splicing sites [11]; several use the minimal free energy principle, e.g. Zuker [30] and stability measures. General purpose programs usually include a language in which the user speci es the sequential and structural components of the pattern to be searched for. Examples of languages which can only search for sequential patterns are QUEST [3], SCRUTINEER [26], ANREP [21] as well as PROSITE [5], which uses a declarative notation. Languages which can search for structural patterns include Staden's language [28], OVERSEER [27], PALM [18], and GENLANG [25]. Palingol [7] is a constraint programming language whose data types and search engines have been adapted for secondary structures, and which is implemented directly in C; constraints are boolean expressions. We think that one of the advantages our our approach over that of Palingol is that our language is a constraint logic programming language and is implemented in clp(FD), giving it a more rigorous basis.

8 Summary We have described a web-based implementation in the nite domain constraint logic programming language clp(FD) of a web-based biosequence structure search engine. The engine is based on a a declarative language with constraints over distances (between strings in terms of nucleotides) and relations over a nite alphabet of nucleotides. Users specify the parameters of the structure which they wish to search for, and also provide an input string over which in the search is to be carried out. The search engine constructs a schema, or generalised structure, which is then matched against the input string and instances returned. These structures range from strings and regular expressions to more complex structures such as palindromes, repeats, stem loops and pseudo-knots. The search engine uses a naive backtracking algorithm for matching, but despite its ineciencies we have tested our implementation on some real biological sequences with encouraging results. We are now in the process of constructing an ecient and specialised CSP-based solver which we intend to interface to the implementation in constraint logic programming. Limitations of the present implementation include the relatively small datasets which can be eciently handled using the POST method; we plan that in the future users can supply the URL of a biosequence database and that our implementation will then retrieve the input string itself using PiLLoW routines. A challenging task for the future will be to extend our program to search for structures in protein databases and to interface it to a plug-in visualiser such as Chime [1].

13

Acknowledgements We wish to thank Daniel Diaz, author of the clp(FD) package, for his help with designing some of the routines needed by our solver, and Manuel Hermenegildo and the other authors of the excellent PiLLoW package. This work has been carried out as part of a project nanced by the British Council and the Norwegian Research Council, which provided funding for the research visits. In addition, Inge Jonassen's research post is nanced by the Norwegian Research Council.

References

[1] Chemscape chime 1.0. http://www.mdli.com/chemscape/chime/. Netscape Navigator plug-in. [2] ISO/IEC 13211{1, Information Technology | Programming Languages | Prolog | Part 1: General Core, 1995. [3] R. M. Abarbanel, P. R. Eiencke, E. Mans eld, D. A. Jae, and D. L. Brutlag. Rapid searches for complex patterns in biological molecules. Nucleic Acids Research, 12(1):263{280, 1984. [4] A. Bairoch, P. Bucher, and K. Hofman. The PROSITE database, its status in 1995. Nucleic Acids Research, 24(1):189{196, 1996. [5] A. Bairoch, P. Bucher, and K. Hofman. The PROSITE database, its status in 1995. Nucleic Acids Research, 24:189{196, 1996. [6] L. Baranyi, W. Campell, K. Ohshima, S. Fujimoto, M. Boros, and H. Okada. The antisense homology box: A new motif within proteins that encodes biologically active peptides. Nature Medicine, 1(9):894{901, 1995. [7] Bernard Billoud, Milutin Kontic, and Alain Viari. Palingol: a declarative language to describe nucleic acids' secondary structures and to scan sequence databases. Nucleic Acids Research, 24(8):1395{1403, 1996. [8] A. Brazma, I. Jonassen, I. Eidhammer, and D. Gilbert. Approaches to the automatic discovery of patterns in biosequences. Technical Report TCU/CS/1995/18, Department of Computer Science, City University, 1995. Also Technical Report 113, Department of Informatics, University of Bergen, Bergen, Norway. Accepted for publication in the Journal of Computational Biology . [9] S. E. Brenner and E. Aoki. Introduction to CGI/PERL. M&T Books, 1996. [10] D. Cabeza, M. Hermenegildo, and S. Varma. The PiLLoW/CIAO Library for INTERNET/WWW Programming using Computational Logic Systems. In Proceedings of the 1st Workshop on Logic Programming Tools for INTERNET Applications, pages 72{90, JICSLP'96, Bonn, September 1996. Text and code available from http://www.clip.dia. .upm.es/miscdocs/pillow/pillow.html. [11] T. Dandekar and P. R. Sibbald. Trans-splicing of pre-mRNA is predicted to occur in a wide range of organisms including vertebrates. Nucleic Acids Research, 18(16):4719{4725, 1990. [12] P. Deransart, A. Ed-Dbali, and L. Cervoni. Prolog: The Standard. Springer, 1996. [13] D. Diaz and P. Codognet. A Minimal Extension of the WAM for clp(FD). In David S. Warren, editor, Proceedings of the Tenth International Conference on Logic Programming, pages 774{ 790, Budapest, Hungary, 1993. The MIT Press. [14] I. Eidhammer, D. Gilbert, I. Jonassen, and M. Ratnayake. A constraint based structure description language for biosequences. Technical report 1997/04, Department of Computer Science, City University, UK and Department of Informatics, University of Bergen, Norway, 1997. [15] Ingvar Eidhammer, David Gilbert, Inge Jonassen, and Madu Ratnayake. A constraint based structure description language for biosequences. In submitted to CP97, 1997.

14 [16] C. Gervet. Conjunto: constraint logic programming with nite set domains. In Maurice Bruynooghe, editor, Logic Programming - Proceedings of the 1994 International Symposium, pages 339{358, Massachusetts Institute of Technology, 1994. The MIT Press. [17] R. Hamming. Coding and Information Theory. Prentice Hall, Englewood Clis, NJ, 1982. [18] C. Helgesen and P. Sibbald. PALM - a pattern language for molecular biology. In L. Hunter, D. Searls, and J. Shavlik, editors, Proceedings First International Conference on Intelligent Systems for Molecular Biology, pages 172{180. AAAI Press, 1993. [19] P. V. Hentenryck and Y. Deville. Operational semantics of constraint logic programming over nite domains. In J. Maluszynski and M. Wirsing, editors, PLILP91, number 528 in LNCS, pages 395{406. Springer-Verlag, aug 1991. [20] V.I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Doklady Akademii nauk SSSR (in Russian), 163(4):845{848, 1965. Also in Cybernetics and Control Theory, vol 10, no. 8, pp 707{710, 1996. [21] G. Mehldau and G. Myers. A system for pattern matching applications on biosequences. CABIOS, 9(3):299{314, 1993. [22] A. Rajasekar. Applications in constraint logic programming with strings. In Alan Borning, editor, PPCP'94: Second Workshop on Principles and Practice of Constraint Programming, Seattle WA, May 1994. [23] Y. Sakakibara, M. Brown, R. Hughey, I.S. Mian, K. Sjoelander, R. Underwood, and D. Haussler. Stochastic context-free grammars for tRNA modelling. Nucleic Acids Res, 22:5112{5120, 1994. [24] D. Searls. The Computational Linguistics of Biological Sequences. Tutorial at Third International Conference on Intelligent Systems for Molecular Biology, 1995. [25] D. B. Searls and S. Dong. A syntactic pattern recognition system for DNA sequences. In C. R. Cantor H. A. Lim, J. Fickett and R. J. Robbins, editors, Proceedings Second International Conference on Bioinformatics, Supercomputing, and Complex Genome Analysis, pages 89{101. World Scienti c, 1993. [26] P. R. Sibbald and P. Argos. Scrutineer: a computer program that exibly seeks and describes motifs and pro les in protein sequences databases. CABIOS, 6(3):279{288, 1990. [27] P. R. Sibbald, H. Sommerfeldt, and P. Argos. Overseer: a nucleotide sequence searching tool. CABIOS, 8(1):45{48, 1992. [28] R. Staden. Searching for Patterns in Protein and Nucleic Acid Sequencies. In R. F. Doolittle, editor, Methods in Enzymology, Vol. 183, pages 193{211. Academic Press, 1990. [29] C. Walinsky. CLP( ): Constraint logic programming with regular sets. In Giorgio Levi and Maurizio Martelli, editors, ICLP'89: Proceedings 6th International Conference on Logic Programming, pages 181{196, Lisbon, Portugal, June 1989. MIT Press. [30] Michael Zuker. On Finding All Foldings of an RNA Molecule. Science, 244:48{52, 1989.