Disjunctive Logic Programming

6 downloads 0 Views 136KB Size Report
to consider motifs with variable length boxes, or with unconstrained distances ... cause significant problems in their effectiveness. ... particular we exploited DLV [13] as the core inference engine and Java to equip .... Problem Statement. .... to the others Q (repetition patterns); moreover, each pattern must be taken from.
Flexible Pattern Discovery with (Extended) Disjunctive Logic Programming Luigi Palopoli1 , Simona Rombo2 and Giorgio Terracina3 2

1 DEIS - Universit` a della Calabria, Via Pietro Bucci, 87036 Rende (CS), Italy DIMET, Universit` a “Mediterranea” di Reggio Calabria, Via Graziella, Localit` a Feo di Vito, 89060 Reggio Calabria, Italy 3 Dipartimento di Matematica, Universit` a della Calabria, Via Pietro Bucci, 87036 Rende (CS), Italy [email protected], [email protected], [email protected]

Abstract. The post-genomic era showed up a wide range of new challenging issues for the areas of knowledge discovery and intelligent information management. Among them, the discovery of complex pattern repetitions in string databases plays an important role, specifically in those contexts where even what are to be considered the interesting pattern classes is unknown. This paper provides a contribution in this precise setting, proposing a novel approach, based on disjunctive logic programming extended with several advanced features, for discovering interesting pattern classes from a given data set.

1

Introduction

In the last few years, a particular class of raw data stored in string databases, namely genomic data, is assuming a prominent role. The completion of the human genome sequencing showed up a wide range of new challenging issues involving sequence analysis. Genome databases mainly consist of sets of strings representing DNA or protein sequences (biosequences) and most of these strings still require to be “interpreted”. In this context, discovering common patterns in sets of biologically related sequences (i.e., motifs) is important because the presence of conserved regions provides insights into the biological function played by the corresponding macromolecules, e.g., for identifying gene regulatory processes [1, 5, 15]. Recently, a particular class of motifs, called structured motifs [16, 15], received much attention since it has been observed [7] that the relative positions of motif regions (we call these regions boxes in the following) are characterized when they participate in biological processes. For instance, the most frequently observed prokaryotic promoter regions are in general composed by two boxes positioned approximately 10 and 35 bases upstream from the transcription start. Moreover, since mutations are quite common and, often, important for the evolution, it is mandatory to allow mismatches among boxes in the analysis of their similarity. Clearly, as the complexity of the considered organisms increases, also the complexity of potentially interesting motifs increases [1] e.g., it is interesting

to consider motifs with variable length boxes, or with unconstrained distances between boxes and so on. In the literature, several algorithms for pattern discovery and several classes of patterns have been considered. As an example, [2, 4, 9, 10, 12] deal with simple (i.e., unstructured) motifs; other approaches consider structured motifs [8, 11, 17] but do not allow mismatches; others identify structured motifs allowing also mismatches [6, 15, 18]4 . All those approaches are tailored on specific classes of patterns and, as with most algorithms, even slight changes in the pattern class to be dealt with may cause significant problems in their effectiveness. Generally speaking, algorithms are available that are very efficient and effective when the class of patterns of interest is quite well defined (e.g., for prokaryotic promoter regions), but when the class of interest is unknown (e.g., in eukaryotes) the problem shifts away from motif extraction to the selection of the right approach to apply. This paper is concerned with the definition and the implementation of a framework allowing for defining and resolving under-specified motif extraction problems where, for instance, the number and the length of boxes can be variable. Our framework (i) is general in that it covers a wide range of pattern classes; (ii) the computed results can be exploited to guide the selection of specific, efficient, algorithms tailored on the resulting pattern classes; (iii) it can be exploited as a “fast prototyping” approach to quickly verify the relevance of new pattern classes in specific biological domains. Notice that, given the complexity of the problem at hand, it is not expected that our approach will be particularly efficient in solving the motif discovery problem. Rather, it is mainly oriented to verify the existence of some kinds of interesting motif classes in available data. Our framework is based on automatically generating logic programs starting from user-defined under-specified extraction problems for locating various kinds of motifs in a set of sequences. In this setting, we exploit disjunctive logic programming extended with a variety of features to obtain a sufficiently high language expressiveness allowing for dealing with a large variety of pattern classes. To the best of our knowledge, this is the first attempt in this direction. We have developed a prototype implementing the proposed approach; in particular we exploited DLV [13] as the core inference engine and Java to equip the system with an easy, user-friendly interface.

2

Problem statement and supported pattern classes

In this section we illustrate the pattern classes currently supported in our framework. As pointed out in the Introduction, the main purpose of our approach is the identification of significant patterns frequently occurring in a set of input sequences; generally speaking, we look for portions of the input sufficiently similar to a number of other portions under some user-defined specifications. It is possible to identify two main categories of specifications, namely, structural specifications and affinity specifications. 4

A classification of the approaches for motif discovery is presented in [3].

Fig. 1. Example of structured patterns

Structural specifications refer to the structure the patterns must have in order to be considered interesting. In its simplest form, a pattern is composed by l contiguous symbols which are all interesting and must be taken into account in pattern repetitions. However, such kind of patterns is practically of low relevance. Indeed, it has been observed that often biologically relevant patterns are constituted by two or more relevant regions (we call these regions boxes in the rest of the paper) separated by a number of irrelevant symbols. Figure 1 graphically shows this concept. It indicates that the pattern constituted by the box ‘AAA’ followed by a number of irrelevant symbols, followed by the box ‘TATT’ is repeated in two of the three input sequences. In this setting, the following structural specifications can be defined: Box number, denoting the number of distinct relevant regions the pattern must have in order to be considered interesting. Box length, indicating the number of symbols in each box; in particular, different boxes may have different lengths and each box length may vary within a given interval. Distance between boxes, indicating the number of irrelevant symbols separating two consecutive boxes; the framework allows both different distances for different box pairs and intervals of variations for each distance. In several application contexts, it is useful to fix the content of some boxes to denote “anchors” in the pattern discovery process; as a consequence, the framework allows for the definition of the following specification: Anchor box, which imposes that a pattern must contain precisely a certain, given content at a certain given position in order to be considered interesting. In the following, for the sake of clarity, we shall use the notation bl1 X(d1 )bl2 X(d2 )...X(dr−1 )blr for representing pattern structure specifications; here, each bli indicates a box of length li , each X(dj ) indicates the sequence of dj irrelevant symbols separating blj from blj+1 , and r is the number of boxes. Both li , dj and r can be intervals. We say that the box bli is at position i within the pattern specification. If structural specifications determine the characteristics of each pattern, affinity specifications fix the relationships that must hold between two patterns (or portions thereof) in order for them to be considered similar. We distinguish between two kinds of affinities: (i) single box affinity and (ii) whole pattern affinity. Single box affinity tells when two boxes of two patterns can be considered similar. So, let s1 and s2 be two strings. The Hamming Distance H between them is the minimum number of symbol substitutions to be applied on s 1 to obtain s2 , whereas the Levenshtein Distance L between them is the minimum number of edit operations (i.e. symbol substitutions, insertions and deletions) to be applied on s1 to obtain s2 . As an example, ‘AAGT ’ and ‘ACGA’ are at

Fig. 2. Examples of whole pattern affinity specifications

Hamming distance H= 2, whereas the Levenshtein distance between ‘misspell’ and ‘mistell’ is L= 2. Single box affinities we consider are the following: Exact match, that implies that two boxes must be exactly equal in order to be considered similar. Hamming distance, for which two boxes are considered similar if their Hamming distance is less than a given threshold. Levenshtein distance, for which two boxes are considered similar if their Levenshtein distance is less than a given threshold. As far as whole pattern affinity is concerned, if no specifications are provided, it is assumed that two patterns p1 and p2 are similar if each pair of boxes b0li of p1 and b00li of p2 (i.e., boxes in the same positions within their patterns) are similar. In this case, we say that a basic similarity holds between p1 and p2 . Otherwise, the following whole pattern affinity specifications can be defined: Allow Skips: two patterns p1 and p2 are considered similar even if a certain number of boxes of p1 are not similar to the corresponding ones of p2 ; the user can specify the maximum allowed number of skips. Allow Changing Box Order: two patterns p1 and p2 are considered similar even if a certain number of changing box order occur; a changing box order occurs if the relative positions of two consecutive boxes in p1 must be exchanged in order to obtain a basic similarity between p1 and p2 . Allow Inverse Box: two patterns p1 and p2 are considered similar even if the content of some (or all) boxes of p1 must be inverted to obtain a basic similarity between p1 and p2 . Figure 2 graphically shows the above presented concepts. We are now ready to state the problem addressed in this paper. Problem Statement. Given a collection of input sequences SC, a set of specifications and an integer Q, find all motifs in SC, that is all patterns satisfying the structural specifications and similar (w.r.t., the provided box affinity and whole pattern affinity specifications) to at least other Q patterns, each of which obtained from a distinct sequence of SC. Q is called quorum and indicates the minimum support a pattern must have within SC. 2 It is worth pointing out that most of the existing pattern discovery approaches do not consider at all neither the Levenshtein distance for box affinity, nor whole pattern affinity specifications introduced in this section. Finally, note that since allowed affinity specifications are such that pattern similarity is not transitive, it may happen that two patterns are similar but only one of them is also a motif.

3

Automatic generation of logic programs

3.1

Preliminaries

As pointed out in the Introduction, we exploit the DLV system [13] as the core inference engine to implement our pattern discovery system, as it provides several advanced features such as, (i) disjunctive rules, (ii) constraints, (iii) arithmetic predicates, (iv) aggregate predicates and (v) external functions. We assume the reader is familiar with (disjunctive) logic programming (see [14] for a good source of background material). Some basic issues of the DLV language are recalled next [13]. In DLV, a disjunctive rule a ∨ b :- c indicates that if c is true, then either a or b must be true. Constraints of the form :- r tell that condition r must be false in all models. Arithmetic predicates #int, #succ, +, * allow to perform operations on integer valued variables. Aggregate predicates can be computed over sets of elements. They can occur in the bodies of rules and constraints, possibly negated by negation-as-failure. As an example, the rule q(X) :- a(X), #count{Z : b(X,Z,k)} < 3 tells that q(X) is true for each X such that a(X) is true and the number of times that b(X,Z,k) is true is less than 3. Finally, external functions are linked dynamically to the logic programs and are executed as standard imperative programs. We have tried to obtain a good trade-off between declarativeness and efficiency limiting the exploitation of external functions. 3.2

Program generation

A first, important, issue to consider is how to represent the patterns satisfying structural specifications that can be derived from the input sequences. Indeed, the formalization of the whole approach depends on this representation. We assume a pattern is represented as a list of ordered boxes; each box is identified by the predicate box(B,IDS,Pos,Idx,L), where B represents the box content, IDS is the identifier of the sequence B has been obtained from, Pos is the starting position of B within IDS, whereas Idx specifies the position (index) of B within the pattern; finally, L is the number of symbols in B. This last information is maintained only for efficiency matters, since it might be anyway derived from B. Before describing how the set of boxes is obtained by the program, it is necessary to introduce the set of facts automatically generated from the user inputs, that are: (1) (2) (3) (4) (5)

string(IDS,‘‘sequence’’) boxNumber(R) quorum(Q) boxLength(Idx,Lmin,Lmax) boxDistance(Idx1,Idx2,Dmin,Dmax)

(6) (7) (8) (9)

error(Idx,Emax) skipNumber(Nskips) minStartPosition(Idx,PosMin,IDS) maxStartPosition(Idx,PosMax,IDS)

There is a fact of type (1) for each input sequence; each sequence is associated with an identifier IDS. Facts (2) and (3) tell the number of boxes allowed for each pattern and the quorum, respectively. Facts of type (4) indicate the minimum

and the maximum length each box might have; here, Idx indicates the position of the box within the pattern and varies between 1 and R. Analogously, facts of type (5) specify the allowed number of irrelevant symbols separating consecutive boxes. Facts of type (6) tell, for each box, the maximum allowed Hamming or Levenshtein distance for box affinity (it is set to 0 if selected box affinity is exact match) whereas, fact (7) encodes the maximum number of skips allowed in whole pattern affinity specification. Finally, facts of type (8) (resp., (9)) are obtained from both the structural specifications and the input sequences and state, for each box, its minimum (resp., maximum) allowed starting position within the input sequence IDS; this allows to reduce the search space. As an example for patterns of the form b3 X(5)b4 the minimum starting position for a second box in a sequence is 9. It is now possible to describe how boxes are represented: (10) box(B,IDS,Pos,Idx,L) :- string(IDS,Str), #int(L), Lmin