yDept. of Computer and Information Sciences, University of Florida, ... In most strings, we encounter the problem of con icts: Consider the string S = abci- ...... 9] E.Horowitz, S. Sahni, Fundamentals of Data Structures in Pascal, 3'rd Edition. Com ...
Computing Display Con icts in String Visualization Dinesh P. Mehta
yz
Sarta j Sahni
y
Technical Report 24
Abstract Strings are used to represent a variety of objects such as DNA sequences, text, and numerical sequences. A model for a system for the visualization and analysis of strings was proposed in [1]. In this paper, we present algorithms which implement some of the queries supported by this model.
Keywords and Phrases:
Strings, visualization, analysis, directed acyclic word graphs.
This research was supported in part by the National Science Foundation under grant MIP 86-17374. Dept. of Computer and Information Sciences, University of Florida, Gainesville, FL 32611 z Dept. of Computer Science, University of Minnesota, Minneapolis, MN 55455
y
1
A B C ZD E F Y D E F X A B C Figure 1: Highlighting displayable entities
1 Introduction The string data type is used to represent a number of objects such as text strings, DNA or protein sequences in molecular biology, numerical sequences, etc. Research in molecular biology, text analysis, and interpretation of numerical data involves the identi cation of recurring patterns in data and hypothesizing about their causes and/or eects [2, 3]. Detecting patterns visually in long strings is tedious and prone to error. In [1], a model was proposed to alleviate this problem. The model consists of identifying all recurring patterns in a string and highlighting identical patterns in the same color. We rst discuss the notion of maximal patterns. Let abc be a pattern occurring m times in a string S . Let the only occurrences of ab be those which occur in abc. Then, the pattern ab is not maximal in S as it is always followed by c. The notion of maximality is motivated by the assumption that in most applications, longer patterns are more signi cant than shorter ones. Maximal patterns that occur at least twice are known as displayable entities. The problem of identifying all displayable entities and their occurrences in S can be solved from the results in [4]. Once all displayable entities and their occurrences are obtained, we are confronted with the problem of color coding them. In the string, S = abczdefydefxabc, abc and def are the only displayable entities. So, S would be displayed by highlighting abc in one color and def in another as shown in Figure 1. In most strings, we encounter the problem of con icts: Consider the string S = abcicdefcdegabchabcde and its displayable entities, abc and cde (both are maximal and occur thrice). So, they must be highlighted in dierent colors. Notice, however, that abc and cde both occur in the substring abcde, which occurs as a sux of S . Clearly, both displayable entities cannot be highlighted in dierent colors in abcde as required by the model. This is 2
ABCI CDE F CDE GABCHABCDE Figure 2: Alternative display model a consequence of the fact that the letter c occurs in both displayable entities. This situation is known as a pre x-sux con ict (because a pre x of one displayable entity is a sux of the other). Note, also, that c is a displayable entity in S . Consequently, all occurrences of c must be highlighted in a color dierent from those used for abc and cde. But this is impossible as c is a subword of both abc and cde. This situation is referred to as a subword con ict. The problem of subword con icts may be partially alleviated by employing more sophisticated display models as in Figure 2. Irrespective of the display model used, it is usually not possible to display all occurrences of all displayable entities. We are therefore forced into having to choose which ones to display. There are three ways of achieving this: Interactive : The user selects occurrences interactively by using his/her judgement. Typically, this would be done by examining the occurrences which are involved in a con ict and choosing one that is the most meaningful. Automatic : A numeric weight is assigned to each occurrence. The higher the weight, the greater the desirability of displaying the corresponding occurrence. Criteria that could be used in assigning weights to occurrences include: length, position, number of occurrences of the pattern, semantic value of the displayable entity, information on con icts, etc. The information is then fed to a routine which selects a set of occurrences so that the sum of their weights is maximized (algorithms for these are discussed in [1]). Semi-Automatic: In a practical environment, the most appropriate method would be a hybrid of the interactive and automatic approaches described above. The user could select some occurrences that he/she wants included in the nal display. The selection of the remaining occurrences can then be performed by a routine which maximizes the display information. All the methods described above require knowledge about the con icts, either to choose 3
which occurrences to display (interactive) or to assign weights to the occurrences (automatic). Automatic methods would require a list of all the con icts, while interactive methods require information about con icts local to a particular segment of the string. Since pre x sux and subword con icts are handled dierently by dierent display models, separate lists for each are required. In this paper we identify a family of problems relating to the identi cation of con icts at various levels of detail. Problems relating to statistical information about con icts are also identi ed. Ecient algorithms for these problems are presented. All algorithms make use of the symmetric compact directed acyclic word graph (scdawg) data structure [4] and may be thought of as operations or traversals of the scdawg. The scdawg, which is used to represent strings and sets of strings evolved from other string data structures such as position trees, sux trees, and directed acyclic word graphs [5, 6, 7, 8]. Section 2 contains preliminaries including de nitions of displayable entities, con icts, and scdawgs. Section 3 presents optimal algorithms to determine whether a string has con icts and to compute subword and pre x sux con icts in a string. Sections 4, 5, and 6 discuss related size restricted, pattern restricted, and statistical problems and show how to implement these by modifying the algorithms of Section 3. Finally, Section 7 presents experimental data on the run times of some of these algorithms.
2 Preliminaries 2.1 De nitions Let S represent a string of length n, whose characters are chosen from a xed alphabet, , of constant size. A pattern in S is said to be maximal i its occurrences are not all preceded by the same letter, nor all followed by the same letter. Consider the string S = abczdefydefxabc. Here, abc and def are the only maximal patterns. The occurrences of def are preceded by dierent letters (z and y ) and followed by dierent letters (y and x). The occurrences of abc are not preceded by the same letter (the rst occurrence does not have a predecessor) nor followed by the same letter. However, de is not maximal because all its 4
occurrences in S are followed by f . A pattern is said to be a displayable entity (or displayable) i it is maximal and occurs more than once in S (all maximal patterns are displayable entities with the exception of S , which occurs once in itself). (i) A subword con ict between two displayable entities, D1 and D2, in S exists i D1 is a substring of D2. (ii) A pre x-sux con ict between two displayable entities, D1 and D2, in S exists i there exist substrings, Sp ; Sm; Ss in S such that Sp Sm Ss occurs in S , Sp Sm = D1, and Sm Ss = D2 . The string, Sm is known as the intersection of the con ict; the con ict is said to occur between D1 and D2 with respect to Sm .
2.2 Symmetric Compact Directed Acyclic Word Graphs (SCDAWGs) An scdawg, SCD(S ), corresponding to a string S is a directed acyclic graph de ned by a set of vertices, V (S ), a set, R(S ), of labeled directed edges called right extension (re) edges, and a set, L(S ), of labeled directed edges called left extension (le) edges . Each vertex of V (S ) represents a substring of S . Speci cally, V (S ) consists of a source (which represents the empty word, ), a sink (which represents S ), and a vertex corresponding to each displayable entity of S . Let de(v ) denote the string represented by vertex, v (v V (S )). De ne the implication, imp(S; ), of a string in S to be the smallest superword of in fde(v): v V (S )g, if such a superword exists. Otherwise, imp(S; ) does not exist. Re edges from a vertex, v1 , are obtained as follows: for each letter, x, in , if imp(S; de(v1)x) exists and is equal to de(v2) = de(v1 )x , then there exists an re edge from v1 to v2 with label x . If is the empty string, then the edge is known as a pre x extension edge. Le edges from a vertex, v1 , are obtained as follows: for each letter, x, in , if imp(S; xde(v1)) exists and is equal to de(v2 ) = xde(v1) , then there exists an le edge from v1 to v2 with label x. If is the empty string, then the edge is known as a sux extension edge. Figure 3 shows V (S ) and R(S ) corresponding to S = cdefabcgabcde. abc, cde, and c are 5
......................................................................................................... ............... ...................... ............. ................ .......... ............ ......... .......... ....... ...... . . . . . . ....... ..... . ..... . . . . ..... .... . . . ...... . . ...... ..... . . . . . ..... ..... ...... . . . ..... .... . . .... . ......................... .... . ..... . . . . . . . . . . . . . . ..... .... ........... ..................................................................................................... .. ... . . . . . . . . . . . . . .... . . . . . . . . . . . . . ... .... . ............ . . . . . . . . . . .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... .. ......... ... ............ .. . . . . .... . . . . . . . . . . . . . . . . . .... .. ...... ...... . . . . . . .... . . . . . . . .... . . . . . . . . .. . .......... .. ...... ..... . .... . . . . . . . . . . . . . . . . . . . .. ....... .... .... .. ....... ..... .... . . . . . . . .... . . . . . . . . . . . . . . ...... .... .. ... ....... ..................... .... . . . . . . . . . . . . . . . . . . . . . . . . . .... . ...... .... ...... ....... .......... ... . .... . . . . .... . . . . . . . . . . . . . . . . . . . . . . . . . . .................. ....... .... ...... ..... .... . .... . . . . . . . . . . . . . . . .... . . . . . . . . ...... ..... ..... ... .... ....... . . . . . . . . . . . . . . . . . . . . . . ...... .... ........ ...... ..... .... ......... . . . . . ...... . . . . . . . . ....... ...... ...... ............ ... ........ . . . . . . . . . . . . . . . . . . . . ...... . .. ..... ..... .... ........... ...... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...... . ...... ........ ................................... ..... ............... ...... .... . . . ...... . . ..... . . . . . . . . .... . . . . . . . . . . . . . . .......... .... ......... .... .... .... .... . . . . . . . . . . . . . . . . . .. .. .. . . . .. .. .. .. .. .. . . . . . . . . . . . .. ... . . . . . . . . . . . . . . . . . . ..... .... .. .. . .............. .............. . .. .. .. . . .. . . .. .. . .. . . . . . ... ... ... .. .. ................ . . .... . . . . . . . . . ..... ........... . .. ..... .. ...................... ...... ................... ............ ..... ...... ....... ...... ............................... ........... ....... ............................... ................................ ................ ........... ........... .... ....... ............ ............ ..... .... ........ ................ ............ ............ ....... ..... ..... . .......... . . . . . . . . . . . . . . . . . . . . . . . . . .... .......... ....... ..... .. .......... .... ............ ....... .... ............ ... ..... ........... ...... .... ............ ............ ....... .... .... ..... ...................... ..................... ....... .... ........... ... ...... ....... ......... ..... ........... ... ................. ....... ..... . . ......... ............ . . . .... . . . . . . . . .... . ............ ............ .. .... . .... .......... ............... ... ................ .. .... .... ............... .................... .... ................ .. ... ..................... ............................... .... .......... ... ............................ ............................................................................ .... . ... . ................... ..... . ... ... . . ..... . . . .. .. .... ... ... ..... .... ... ... .... ..... ..... .... ..... ......... ...... .... ..... ..................... .... ..... . . . ..... ... ....... ..... ....... ..... ...... ..... ....... ..... ...... ...... . . . . ....... . . .... ....... ...... .......... ......... ........... .......... ................. ............. ....................... .................... .................................. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...............................................................
gabcde
e
cde
de
source
c
c
de fabcgabcde gabcde
sink
de
abc
abc
bc
gabcde
fabcgabcde
Figure 3: Scdawg for S = cdefabcgabcde (L(S ) not shown) the displayable entities of S . There are two outgoing re edges from the vertex representing abc. These edges correspond to x = d and x = g. imp(S; abcd) = imp(S; abcg) = S . Consequently, both edges are incident on the sink. There are no edges corresponding to the other letters of the alphabet as imp(S; abcx) does not exist for x fa; b; c; e; f g. The space required for SCD(S ) is O(n) and the time needed to construct it is O(n) [5, 4]. While we have de ned the scdawg data structure for a single string, S , it can be extended to represent a set of strings.
2.3 Computing Occurrences of Displayable Entities Figure 4 presents an algorithm for computing the end positions of all the occurrences of de(v) in S . This is based on the outline provided in [4]. The complexity of Occurrences(S; v; 0) is proportional to the number of occurrences of de(v ) in S .
6
Algorithm A
Occurrences(S; v; 0)
Procedure Occurrences(S :string,u:vertex,i:integer) begin if de(u) is a sux of S then output(jS j ? i); for each right out edge, e, from u do begin end;
Let w be the vertex on which e is incident; Occurrences(w,jlabel(e)j + i); end; Figure 4: Algorithm for obtaining occurrences of displayable entities
2.4 Pre x and Sux Extension Trees The pre x extension tree, PET (S; v ), at vertex v in V (S ) is a subgraph of SCD(S ) consisting of (i) the root, v , (ii) PET (S; w) de ned recursively for each vertex w in V (S ) such that there exists a pre x extension edge from v to w, and (iii) the pre x extension edges leaving v . The sux extension tree, SET (S; v ), at v is de ned analogously. In Figure 3, PET (S; v ), de(v ) = c, consists of the vertices representing c and cde, and the sink. It also includes the pre x extension edges from c to cde and from cde to the sink. Similarly, SET (S; v ), de(v ) = c, consists of the vertices representing c and abc and the sux extension edge from c to abc (not shown in the gure).
Lemma 1 PET (S; v) (SET (S; v)) contains a directed path from v to a vertex, w, in V (S ) i de(v ) is a pre x (sux) of de(w).
Proof If there is a directed path in PET (S; v), from v to some vertex, w, then from the
de nition of a pre x extension edge and the transitivity of the \pre x of" relation, de(v ) must be a pre x of de(w). If de(v ) is a pre x of de(w), then there exists a series of re edges from v to w, such that de(v), when concatenated with the labels on these edges yields de(w). But, each of these 7
re edges must be a pre x extension edge. So a directed path from v to w exists in the PET (S; v). The proof for SET (S; v ) is analogous. 2
3 Computing Con icts 3.1 Algorithm to determine whether a string is con ict free Before describing our algorithm to determine if a string is free of con icts, we establish some properties of con ict free strings that will be used in this algorithm.
Lemma 2 If a pre x-sux con ict occurs in a string S , then a subword con ict must occur
in S .
Proof If a pre x-sux con ict occurs between two displayable entities, W and W then 1
2
there exists Wp Wm Ws such that Wp Wm = W1 and Wm Ws = W2 . Since W1 and W2 are maximal, W1 isn't always followed by the same letter and W2 isn't always preceded by the same letter. I.e., Wm isn't always followed by the same letter and Wm isn't always preceded by the same letter. So, Wm is maximal. But, W1 occurs at least twice in S (since W1 is a displayable entity). So Wm occurs at least twice (since Wm is a subword of W1) and is a displayable entity. But, Wm is a subword of W1. So a subword con ict occurs between Wm and W1. 2
Corollary 1 If string S is free of subword con icts, then it is free of con icts. Lemma 3 de(w) is a subword of de(v) in S i there is a path comprising right extension and sux extension edges from w to v .
Proof From the de nition of SCD(S ), if there exists an re edge from u to v, then de(u)
is a subword of de(v ). If there exists a sux extension edge from u to v , then de(u) is a sux (and therefore a subword) of de(v ). If there exists a path comprising right and sux extension edges from w to v , then by transitivity, de(w) is a subword of de(v ). 8
Algorithm NoCon icts( ) 1. Construct ( ). 2. Compute source. 3. Scan all right and sux extension out edges from each element of S
SC D S
V
. If any edge points to a vertex other than the sink, then a con ict exists. Otherwise, is con ict free. Vsource
S
Figure 5: Algorithm to determine whether a string is Con ict Free If de(w) is a sux of de(v ), then there is a path (Lemma 1) of sux extension edges from w to v . If de(w) is a subword, but not a sux of de(v ), then from the de nition of an scdawg, there is a path of re edges from w to a vertex representing a sux of de(v ). 2 Let Vsource denote all vertices in V (S ) such that an re or sux extension edge exists between the source vertex of SCD(S ) and each element of Vsource .
Lemma 4 String S is con ict free i all right extension or sux extension edges leaving vertices in Vsource end at the sink vertex of SCD(S ).
Proof A string, S is con ict free i there does not exist a right or sux extension edge
between two vertices, neither of which is the source or sink of SCD(S ) (Corollary 1 and Lemma 3). Assume that S is con ict free. Consider a vertex, v , in Vsource . If v has right or sux extension out edge < v; w >, then v 6= sink. If w 6= sink, then de(v ) is a subword of de(w) and the string is not con ict free. This contradicts the assumption on S . Next, assume that all right and sux extension edges leaving vertices in Vsource end at the sink vertex. Clearly, there cannot exist right or sux extension edges between any two vertices, v and w (v 6= sink, w 6= sink) in Vsource . Further, there cannot exist a vertex, x, in V (S ) (x 6= source, x 6= sink) such that x 6 Vsource . For such a vertex to exist, there must exist a path consisting of right and sux extension edges from a vertex in Vsource to x. Clearly, this is not true. So, S is con ict free. 2 The preceding development leads to algorithm NoCon icts (Figure 5). 9
Theorem 1 Algorithm NoCon icts is both correct and optimal. Proof Correctness is an immediate consequence of Lemma 4. Step 1 takes O(n) time [4]. Step 2 takes O(1) time since jVsource j < 2jj. Step 3 takes O(1) time since the number of out edges leaving Vsource is less than 4j j. So, NoCon icts takes O(n) time, which is 2
optimal. Actually, steps 2 and 3 can be merged into step 1 and the construction of SCD(S ) aborted as soon as an edge that violates Lemma 4 is created. 2
3.2 Subword Con icts Consider the problem of nding all subword con icts in string S . Let ks be the number of subword con icts in S . Any algorithm to solve this problem requires (i) O(n) time to read in the input string and (ii) O(ks) time to output all subword con icts. So, O(n + ks ) is a lower bound on the time complexity for this problem. For the string S = an , ks = n4 =24 + n3=4 ? 13n2=24 ? 3n=4 + 1 = O(n4). This is an upper bound on the number of con icts as the maximum number of substring occurrences is O(n2) and in the worst case, all occurrences con ict with each other. In this section, a compact method for representing con icts is presented. Let ksc be the size of this representation. ksc is n3 =6 + n2 =2 ? 5n=3 or O(n3), for an . Compaction never increases the size of the output and may yield up to a factor of n reduction, as in the example. The compaction method is described below. Consider S = abcdbcgabcdbchbc. The displayable entities are D1 = abcdbc and D2 = bc. The ending positions of D1 are 6 and 13 while those of D2 are 3, 6, 10, 13, and 16. A list of the subword con icts between D1 and D2 can be written as: f(6,3), (6,6), (13,10), (13,13)g. The rst element of each ordered pair is the last position of the instance of the superstring (here, D1) involved in the con ict; the second element of each ordered pair is the last position of the instance of the substring (here, D2 ) involved in the con ict. The cardinality of the set is the number of subword con icts between D1 and D2 . This is given by: frequency(D1)number of occurrences of D2 in D1. Since each con ict is represented by an ordered pair, the size of the output is 2(frequency(D1)number of occurrences of D2 in D1). 10
Observe that the occurrences of D2 in D1 are in the same relative positions in all instances of D1 . It is therefore possible to write the list of subword con icts between D1 and D2 as: (6,13):(0,-3). The rst list gives all the occurrences in S of the superstring (D1), and the second gives the relative positions of all the occurrences of the substring (D2) in the superstring (D1) from the right end of D1. The size of the output is now: frequency(D1)+number of occurrences of D2 in D1. This is more economical than our earlier representation. In general, a substring, Di , of S will have con icts with many instances of a number of displayable entities (say, Dj ; Dk ; : : :; Dz ) of which it (Di) is the superword. We would then write the con icts of Di as: (li1; li2; : : :; lim ) : (lj1; lj2; : : :; ljm ); (lk1; lk2; : : :; lkm ); : : :; (lz1; lz2; : : :; lzm ). i
j
k
z
Here, the li 's represent all the occurrences of Di in S ; the lj0 s; lk0 s; : : :; lz0 s represent the relative positions of all the occurrences of Dj ; Dk ; : : :; Dz in Di . One such list will be required for each displayable entity that contains other displayable entities as subwords. The following equalities are easily obtained: P P Size of Compact Representation = D D (fi + D D (rij )). P P Size of Original Representation = 2 D D (fi D D (rij )). fi is the frequency of Di (only Di 's that have con icts are considered). rij is the frequency of Dj in one instance of Di . D represents the set of all displayable entities of S . Dis represents the set of all displayable entities that are subwords of Di . s i
j
i
i
j
s i
SG(S; v), v V (S ), is de ned as the subgraph of SCD(S ) which consists of the set of vertices, SV (S; v ) V (S ) which represents displayable entities that are subwords of de(v ) and the set SE (S; v ) of all re and sux extension edges that connect any pair of vertices in SV (S; v ). De ne SGR(S; v ) as SG(S; v ) with the directions of all the edges in SE (S; v ) reversed.
Lemma 5 SG(S; v) consists of all vertices, w, such that a path comprising right or sux
extension edges joins w to v in SCD(S ).
Proof Follows from Lemma 3. 2 11
Algorithm B 1 begin 2 for each vertex, , in ( ) do 3 begin 4 .subword = false; 5 for all vertices, , such that a right or sux extension edge, 6 if =6 source then .subword = true; 7 end 8 for each vertex, , in ( ) such that = 6 and is 9 GetSubwords( ); 10 end Procedure GetSubwords( ) 1 begin 2 Occurrences( , ,0); 3 output( ); 4 = f0g; v
SC D S
v
u
u
, is incident on do
< u; v >
v
v
SC D S
v
sink
v:subword
true
do
v
v
S v
v:list
5 6 7 8 9 10 11 12 13 14 15 16 17
v:sublist
SetUp( ); SetSuxes( ); for each vertex, (6= source), in reverse topological order of v
v
( ) do begin if ( ) is a sux of ( ) then = f0g else = fg; for each vertex, , in ( ) on which an re edge, from is incident do begin for each element, , in do = [ f ? j ( )jg; end; output( ); end; x
de x
de v
w
l
x:sublist
x:sublist
SG S; v
SG S; v
x:sublist
e
x
w:sublist
x:sublist
l
label e
x:sublist
end
Figure 6: Optimal algorithm to compute all subword con icts
12
v
Algorithm B of Figure 6 computes the subword con icts of S . The subword con icts are computed for precisely those displayable entities which have subword displayable entities. Lines 4 to 6 of Algorithm B determine whether de(v ) has subword displayable entities. Each incoming right or sux extension edge to v is checked to see whether it originates at the source. If any incoming edge originates at a vertex other than source, then v .subword is set to true (Lemma 3). If all incoming edges originate from source, then v .subword is set to false. Procedure Getsubwords(v ), which computes the subword con icts of de(v ) is invoked if v:subword is true.
Procedure Occurrences(S; v; 0) (line 2 of GetSubwords) computes the occurrences of de(v ) in S and places them in v.list. Procedure SetUp in line 5 traverses SGR(S; v ) and initializes elds in each vertex of SGR(S; v ) so that a reverse topological traversal of SG(S; v ) may be subsequently performed. Procedure SetSuxes in line 6 marks vertices whose displayable entities are suxes of de(v ). This is accomplished by following the chain of reverse sux extension pointers starting at v and marking the vertices encountered as suxes of v . A list of relative occurrences, sublist, is associated with each vertex, x, in SG(S; v ). x.sublist represents the relative positions of de(x) in an occurrence of de(v). Each relative occurence is denoted by its position relative to the last position of de(v ) which is represented by 0. If de(x) is a sux of de(v ) then x.sublist is initialized with the element, 0. The remaining elements of x.sublist are computed from the sublist elds of vertices, w, in SG(S; v) such that a right extension edge goes from x to w. Consequently, w.sublist must be computed before x.sublist. This is achieved by traversing SG(S; v ) in reverse topological order [9].
Lemma 6 x:sublist for vertex, x, in SG(S; v) contains all relative occurrences of de(x) in
de(v) on completion of GetSubwords(v).
Proof The correctness of this lemma follows from the correctness of procedure Occurrences(S; v; 0) of Section 2.3 and the observation that lines 7 to 15 of procedure GetSubwords achieve the same eect as Occurrences(S; v; 0) in SG(S; v ). 2
Theorem 2 Algorithm B takes O(n + ksc) time and space and is therefore optimal. 13
Proof Computing v:subword for each vertex, v, in V (S ) takes O(n) time as constant time
is spent at each vertex and edge in SCD(S ). Consider the complexity of GetSubwords(v ). Lines 2 and 3 take O(jv:listj) time. Let the number of vertices in SG(S; v ) be m. Then the number of edges in SG(S; v ) is O(m). Line 5 traverses SG(S; v ) and therefore consumes O(m) time. Line 6, in the worst case, could involve traversing SG(S; v) which takes O(m) time. Computing the relative occurrences of de(x) in de(v ) (lines 9-15) takes O(jx:sublistj) time for each vertex, x, in SG(S; v ). So, the total complexity of GetSubwords(v ) is O(jv:listj+ m + xSV (S;v);x6=v jx:sublistj).
P
However, m is O( xSV (S;v);x6=v jx:sublistj), since jx:sublistj 1 for each x SG(S; v ). P But jv:listj + xSV (S;v);x6=v jx:sublistj is the size of the output for GetSubwords(v ). So, the over all complexity of algorithm B is O(n + vV (S )?fsinkg;v:subword=true joutput for GetSubwords(v )j) = O(n + ksc ). 2
3.3 Pre x Sux Con icts As with subword con icts, the lower bound for the problem of computing pre x-sux con icts is O(n + kp ), where kp is the number of pre x-sux con icts in S . For S = an , kp is n4 =24 ? n3 =12 ? 25n2 =24 ? 21n=12 + 1 = O(n4 ), which is also the upper bound on kp . Unlike subword con icts, it is not possible to compact the output representation. Let w and x, respectively, be vertices in SET (S; v ) and PET (S; v ). Let de(v ) = Wv , de(w) = Ww Wv , and de(x) = Wv Wx . De ne Pshadow(w; v; x) to be the vertex representing imp(S; Ww Wv Wx ), if such a vertex exists. Otherwise, Pshadow(w; v; x) = nil. We de ne Pimage(w; v; x) = Pshadow(w; v; x) i Pshadow(w; v; x) = imp(S; Ww Wv Wx ) = Wa Ww Wv Wx for some (possibly empty) string, Wa . Otherwise, Pimage(w; v; x) = nil. For each vertex, w in SET (S; v ), a shadow pre x dag, SPD(w; v ), rooted at vertex w is comprised of the set of vertices fPshadow(w; v; x)j x on PET (S; v ), Pshadow(w; v; x) 6= nilg. Figure 7 illustrates these concepts. Broken lines represent sux extension edges, dotted lines represent right extension edges, and solid lines represent pre x extension edges. 14
. . . . . . . . . . . . . . ................. . . . . . . ... . . . .... . . . .. . . . . . . . .. . . . .... ... ....... ....... . . ..... ....................... . . . . . . . . . . . . . . . . . ....... ..... . . .. ...... ....... . . . . ...... ...... . .. ....... ................. . . ................. . . . . ...... . . .. . . . . . ....... . . . .... ... ... ....... ...... . . . .... . . . . . . . ... . . . . .. . . . . ..... . ..... .. . ... .. .. ....... .. ...... . . . . .. .. . ....... ...... ....................... ............................... ......... .... . . . ....... . ......... . ..... ................... ... . ..... ....... . . . .. . ...... . . . . . . . . . . . . . . . . . .. ..... .... ...... . . .. ....... . . . .. .... ...... . . . . . . .. ....... . . .. ....... . . . . .. ....... . . . . ...... ....... ................. . .. . . . . . . . .................... . . . . . . . ... . . . ..... .... . ..... . . .. ...... . .... ... ...... . . . . .. .. ... ....... ................... .. ....... . ... .. ...................... . . . ...... .. ... ..................... .... ....... . . .. ............ ...... ... . . . . . . . . . . ... . . . . . . . . . . . . . . . . ..... . .. ...... ...... ... ...... . ..... ..... ............ . . . . . . . . . . ..... . . . . . ........... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... . . . . . . . . . . . . . . . . . ....... .... ....... ...... ...... ..... . . ..... .............................. . . . . . . . . . . . . . . . . .. .. ..... . .. ..... . . . . . ... .. . ...... . . ...... .. ......... ... .................. ....... . . . . ..... .............. ...... ........... .. ....... . ... .. ................. ......... ........ .. ...... . ........................... ......................... ....... ..... . . . .... ......... ...... ..... ............... .......... ....... ....... ................. . . . . . . . . . . . . . . . . . . . . . . ................ . ... . ...... ........ . . . . . . . .. . ...... .... ... .. .... .................... ..... ...... . .. .... ............. ..... .............. ...... .... ....................... .. ....... . . .. ... ...... ....... ... .. .............. ....... ...... ... ... . ...... . ...... ... .. . . . . . . . ... . . . . . . . . . . . . . ..... ...... ...... .. ........ ............. . ..... . . . . . . ........... . . . . . . . . . . . . . . . . . . . . . . . . . . ... . . . . . . . ...... . .............. .................. ...... ... . . . . . . ... . .... .. .. .. . . .. ............. . ....... . .. ...... .. ....... .... v ... ..... ...... ... ..... ........... ....................... ...... . .. .......... .... ........ ........... . . . . . ... ...... ....... . . . . ...... ..... ........... .. .... . ...... ...... ... ...... .. ...... ... ..... .................... ... ... ....... ... ..... ....... ................ ...................... ..... ... . . . . . . . . . . . . . . . . . . . . . ....... ...... ........ ....... ... . ............ ...... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... . . . . . . . . . . . . . . . ...... ....... .... .... ....... ..................... . ..... ........ ........ . . . . . . . . . . . . . . . . . . . . . . . . ...... .. ............ .. .. ....................... ... ....... ............ .. ... .. . . ..... ....... .. .. ... ... ... ....... ...... .. ...... .. ..... ............ ...... ..... .. ....... .... ......................... ....... ................... ........................... .... ................. . ...... . . . .................... . . . . ............... . . . . . . . . . ... . . . . ...... ... ................ . ....... .. . ... ... ...... ...... . ..................... ... ..................... ..... ....... ... ..... ............. ............... ... ....... ... ...... ... ... ....... ... ... . . . . . . . . . . ...... ..... ...... ........ .. . ...... ................... . . . . . . . . . . . .... ... . ....... .. . . . . . . . . ....... .. ...... ...... .. ...... ..... ....... .. ....... ...... .. ....... ..... ....... . . . ... .. ...... . ....... .... . ....... ..... . ... ....... ...... .... . ....... . ..... . ....... ... ....... ...... . . ....... .. ... ..... .....
!
. . . . . . . . . .
w
m
a
b
SPD(w; v)
c
l
x
n
q
k
z
p
r s
SET (S; v)
Figure 7: Illustration of pre x and sux trees and a shadow pre x dag
15
PET (S; v)
SET (S; v), PET (S; v), and SPD(w; v) have been enclosed by dashed, solid, and dotted lines respectively. We have: Pshadow(w; v; v ) = Pimage(w; v; v ) = w. Pshadow(w; v; z ) = Pshadow(w; v; r) = c. However, Pimage(w; v; z) = Pimage(w; v; r) = nil. Pshadow(w; v; x) = Pimage(w; v; x) = a. Pshadow(w; v; p) = b, but Pimage(w; v; p) = nil. Pshadow(w; v; q ) = Pshadow(w; v; s) = Pimage(w; v; q ) = Pimage(w; v; s) = nil.
Lemma 7 A pre x-sux con ict occurs between two displayable entities, W = de(w) 1
and W2 = de(x) with respect to a third displayable entity Wm = de(v ) i (i) w occurs in SET (S; v ) and x occurs in PET (S; v ), and (ii) Pshadow(w; v; x) 6= nil. The number of con icts between de(w) and de(x) with respect to de(v ) is equal to the number of occurrences of de(Pshadow(w; v; x)) in S .
Proof By de nition, a pre x-sux con ict occurs between displayable entities W and 1
W2 with respect to Wm i there exists Wp WmWs in S , where W1 = Wp Wm and W2 = WmWs . Clearly, Wm is a sux of W1 and Wm a pre x of W2 i w occurs in SET (S; v ) and x occurs in PET (S; v ). Wp Wm Ws occurs in S i imp(S; WpWm Ws ) = Pshadow(w; v; x) 6= nil. The number of con icts between de(w) and de(x) is equal to the number of occurrences of imp(S; WpWm Ws ) = Pshadow(w; v; x) in S . 2
Lemma 8 If a pre x-sux con ict does not occur between de(w) and de(x) with respect
to de(v ), where w occurs in SET (S; v ) and x occurs in PET (S; v ), then there are no pre x-sux con icts between any displayable entity which represents a descendant of w in SET (S; v ) and any displayable entity which represents a descendant of x in PET (S; v ) with respect to de(v ).
Proof Since w is in SET (S; v) and x is in PET (S; v), we can represent de(w) by Wpde(v)
and de(x) by de(v )Ws. If no con icts occur, then Wp de(v )Ws does not occur in S . The descendants of w in SET (S; v ) will represent displayable entities of the form Wa de(w) = Wa Wp de(v), while the descendants of x in PET (S; v) will represent displayable entities of the form de(x)Wb = de(v )WsWb , where Wa ,Wb are substrings of S . For a pre x-sux 16
f
............ ..................... ..................... ..... ........ .... ... .... ... ... .. .. .............. ... . . . . . . . . . . . . . . . . . . . . . . . . . ............... .... .. ..... .. . . . . . . . . . . . . . . . . . . . . . . . . . ........... ... . . . .. . . . . ... ... .... . ... . ... . ..... . . . . . . ....... ......... . . .................. ................ ........... .... ...... . ... .. . ... . .. . ... .. . . . .......... ..... ........ . . . .. ... . ... .... ... ....... ......... ........... . ..... . .... .. ... .. .. ... . .. . ... . .. . . . . .. ....... ..................... ..................... ..................... ..... ........ . .... .. .... .. .... .. . . .. .. .. .. .. .. ................ ............. ............. .... . ... ... ......... .............. ........... . . .. . .. ... .. . . . ... ... .. ... .. .... .... . ...... ........ . . . . ...................... . . ...................... ................. ............
w
y
u
r
v
t
............. ......... ....... ....... ....... ....... ....... ....... ....... ....... ....... ............................. . . . . . . . . . . . . . ......................
x
e
z
= Pre x Extension Edge = Sux Extension Edge = Right Extension Edge
Figure 8: Illustration of conditions for Lemma 9 con ict to occur between Wa de(w) and de(x)Wb with respect to de(v ), Wa Wp de(v )WsWb must exist in S . However, this is not possible as Wp de(v )Ws does not occur in S and the result follows. 2
Lemma 9 In SCD(S ), if (i) y = Pimage(w; v; x), (ii) there is a pre x extension edge, e, from x to z with label a. (iii) there is a right extension edge, f , from y to u with label a , then Pshadow(w; v; z ) = u.
Proof Let de(w) = Ww de(v), de(x) = de(v)Wx. By de nition, de(y) = WaWw de(v)Wx
for some possibly empty string Wa . de(z ) = de(x)a = de(v )Wxa. de(u) = Wb de(y )a = Wb Wa Ww de(v)Wxa for some string Wb .
Pshadow(w; v; z) = imp(Ww de(v)Wxa). To prove the lemma, we must show that Pshadow(w; v; z) = u. I.e., that (i) Ww de(v)Wxa is a subword of de(u) and (ii) de(u) is the smallest superword of Ww de(v )Wxa represented by a vertex in SCD(S ). 17
(i) Assume that Ww de(v )Wxa is not a subword of de(u) = Wb Wa Ww de(v )Wxa . I.e., is not a pre x of . Case 1: is a proper pre x of . Since Wb Wa Ww de(v )Wxa is maximal, its occurrences are not all followed by the same letter. This is true for any of its suxes. In particular all occurrences of de(v )Wxa cannot be followed by the same letter. Similarly, all occurrences of de(v )Wxa cannot be preceded by the same letter as it is a pre x of de(v )Wxa = de(z ). So, de(v )Wxa is a displayable entity of S . Consequently, the pre x extension edge from x corresponding to the letter a must be directed to the vertex representing de(v )Wxa . This is a contradiction. Case 2: a matches a in the rst k characters, but not in the (k + 1)0th character (1 k < 1+min(jj,j j)). We have a = a 1, a = a 1, where j j = k ? 1. Clearly, the strings de(v )Wxa 1 and Wb Wa Ww de(v)Wxa 1 occur in S . I.e., all occurrences of de(v)Wxa cannot be followed by the same letter. Further, all occurrences of de(v )Wxa cannot be preceded by the same letter as it is a pre x of de(v )Wxa = de(z ). So, it is a displayable entity of S . Consequently, the pre x extension edge from x corresponding to the letter a must be directed to the vertex representing de(v )Wxa . This results in a contradiction. Thus, is a pre x of . (ii) From (i), is a pre x of . Assume that Wb Wa Ww de(v )Wxa is not the smallest superword of Ww de(v )Wxa. Since de(y ) = Pimage(w; v; x) = Wa Ww de(v )Wx is the smallest superword of Ww de(v )Wx, the smallest superword of Ww de(v )Wxa must be of the form Wb1 WaWw de(v)Wxa where is a pre x of which is a proper pre x of and/or Wb1 is a proper sux of Wb . But, the right out edge, f , from z points to the smallest superword of Wa Ww de(v)Wxa (from the de nition of SCD(S )) which is Wb WaWw de(v)Wxa . So, Wb1 = Wb , = , which is a contradiction. 2
Lemma 10 In SCD(S ), if (i) y = Pimage(w; v; x), (ii) there is a path of pre x extension
edges from x to x1 (let the concatenation of their labels be a), (iii) there is a pre x extension edge from x1 to z with label b , and (iv) there is a right extension edge, f from y to u with label ab , then u = Pshadow(w; v; z ) 6= nil.
Proof Similar to proof of Lemma 9. 2 18
f
....................... ....................... ....................... ... ... ... .. .. .. .. .. .............. . . . . . . . . . . . . . . . . . . . . . . . . . .......... . .... .. .. .. .... ........... .. . . . . . . . . . . . . . . . . . . . . . . . ................. ... ... ... ... . ... . . . . . ..... . . ...... ........ ...... ..... ................... . . . . . . . . . . . . . . . . . . . . . . . . .. ....... . ...... .. .. . ... .. .. .. . .. .. ... .................... . . . ... ... . ... .... .. ... ... ..... ...................... ..... . . ...... . ... Path P . ... ....... .. .. .. .... ...... .. .. ... . .. . ... . . . .................... . ...................... ...................... ...................... . . ... ... ... ... .. .. .. ... . ............... ................ ............. . . . . . . . ................ .. .. .. .. . . .... . .............. .............. .......... .............. .. .... ... ... ... .... 1 ..... .... .... . . . . . . . . . ........ ....... ...................... ...................... ...................... ........
w
y
u
ab
r
v
t
x
x
............. ..........
....... ....... ....... ....... ....... ....... ....... ....... ....... .............................. . . . . . . . . . . . . . .......................
a
b
= Pre x Extension Edge = Sux Extension Edge = Right Extension Edge
Figure 9: Illustration of conditions for Lemmas 10 and 11
19
........... ............
. ...................... ... .. ................ .. . .............. ... .... .........................
z
Algorithm C 1 Construct ( ). 2 for each vertex, v, in SC D S
3
NextSux(v,v);
( ) do
SC D S
Procedure NextSux(current,v); 1 for each sux extension edge current do 2 fthere can only be one sux extension edge from current to g 3 begin 4 exist = false 5 ShadowSearch( ); 6 if exist then NextSux(w,v); 7 end;
w
v; w; v; w
Figure 10: Optimal algorithm to compute all pre x-sux con icts
Lemma 11 In Lemma 9 or Lemma 10, if jlabel(f )j sum of the lengths of the labels of
of the edges on the pre x extension edge path P from x to z , then label(f ) = concatenation of the labels on P and u = Pimage(w; v; z ).
Proof From Lemma 10, the concatenation of the labels of the edges of P is a pre x of label(f ). But, jlabel(f )j sum of the lengths of the labels of the edges on P . I.e., label(f ) = concatenation of the labels of the series edges on P . de(u) = Pimage(w; v; z ) follows.
2
Lemma 12 If Pshadow(w; v; x) = nil then Pshadow(w; v; y) = nil for all descendants, y, of x in PET (S; v ).
Proof Follows from Lemmas 7 and 8. 2 Algorithm C in Figure 10 computes all pre x-sux con icts of S . Line 1 constructs SCD(S ). Lines 2 and 3 compute all pre x-sux con icts in S by separately computing for each displayable entity, de(v ), all the pre x-sux con icts of which it is the intersection.
Procedure NextSux(current,v) computes all pre x-sux con icts between displayable entities represented by descendants of current in SET (S; v ) and displayable entities represented by descendants of v in PET (S; v ) with respect to de(v ) (so the call to NextSux(v,v) 20
in line 3 of Algorithm C computes all pre x-sux con icts with respect to de(v )). It does so by identifying SPD(w; v ) for each child, w, of current in SET (S; v ). The call to ShadowSearch(v,w,v,w) in line 5 identi es SPD(w; v ) and computes all pre x-sux con icts between de(w) and displayable entities represented by descendants of v in PET (S; v ) with respect to de(v ). If ShadowSearch(v,w,v,w) does not report any pre x-sux con icts then the global variable exist is unchanged by ShadowSearch(v,w,v,w) (i.e., exist = false, from line 4). Otherwise, it is set to true by ShadowSearch. Line 6 ensures that NextSux(w,v) is called only if ShadowSearch(v,w,v,w) detected pre x sux con icts between de(w) and displayable entities represented by descendants of v in PET (S; v ) with respect to de(v ) (Lemma 8). For each descendant, q , of vertex x in PET (S; v ), procedure ShadowSearch(v,w,x,y) computes all pre x sux con icts between de(w) and de(q ) with respect to de(v ). y represents Pshadow(w; v; x). We will show that all calls to ShadowSearch maintain the invariant (which is referred to as the image invariant hereafter) that y = Pimage(w; v; x) 6= nil. Notice that the invariant holds when ShadowSearch is called from NextSux as w = Pimage(w; v; v). The for statement in line 1 examines each pre x out edge from x. Lines 3 to 28 compute all pre x sux con icts between de(w) and displayable entities represented by vertices in PET (S; z ), where z is the vertex on which the pre x extension edge from x is incident. The truth of the condition in the for statement of line 1, line 4 and the truth of the condition inside the if statement of line 5 establish that the conditions of Lemma 9 are satis ed prior to the execution of lines 8 and 9. The truth of the comment in line 8 and the correctness of line 9 are established by Lemma 9. Procedure ListCon icts of line 9 lists all pre x sux con icts between de(w) and de(z ) with respect to de(v ). Similarly, the truth of the condition inside the while statement of line 11, lines 13 and 14, and the truth of the condition inside the if statement of line 15 establish that the conditions of Lemma 10 are satis ed prior to the execution of lines 18-20. Again, the correctness of lines 18-20 are established by Lemma 10. If done remains false on exiting the while loop, the condition of the if statement of line 15 must have evaluated to true. Consequently, the conditions of Lemma 10 apply. Further, since the while loop of line 11 terminated, the additional condition of Lemma 11 is also satis ed. Hence, from Lemma 11, u = Pimage(w; v; z ) and the 21
Procedure ShadowSearch( ); 1 for each pre x extension edge = do 2 fThere can only be one pre x extension edge from to g 3 begin 4 := rst character in ( ); 5 if there is a right extension edge, = , whose label starts with 6 then 7 begin 8 f = ( )g v; w; x; y e
< x; z >
x
fc
label e
f
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
z
u
< y; u >
P shadow w; v; x
ListCon icts(u,z,w); distance:= 0; done = false while (not done) and (j ( )j j label f
begin
>
( )j + distance)) do
label e
distance:= distance + j ( )j; nc:= ( + 1)th character in if there is a pre x extension edge label e
distance
then begin := ; f = ( )g; ListCon icts(u,z,w); end else done:= true; end if (not done) then z
( ). starting with nc
label f
< z; r >
r
u
P shadow w; v; z
ShadowSearch( , , , ); exist:= true; v w z u
end
fc
end
Figure 11: Algorithm for shadow search
22
image invariant for the recursive call to ShadowSearch(v; w; z; u) is maintained. Line 27 sets the global variable exist to true since the execution of the then clause of the if statement of line 5 ensures that at least one pre x-sux con ict is reported by ShadowSearch(v; w; v; w) (Lemmas 7 and 9). exist remains false only if the then clause of the if statement (line 5) is never executed.
Theorem 3 Algorithm C computes all pre x-sux con icts of S in O(n + kp) space and time, which is optimal.
Proof Line 1 of Algorithm C takes O(n) time [4]. The cost of lines 2 and 3 without including the execution time of NextSux(v; v ) is O(n).
Next, we show that NextSux(v; v ) takes O(kv ) time, where kv is the number of pre x sux con icts with respect to v (i.e., kv represents the size of the output of NextSux(v; v )). Assume that NextSux is invoked p times in the computation. Let ST be the set of invocations of NextSux which do not call NextSux recursively. Let pT = jST j. Let SF be the set of invocations of NextSux which do call NextSux recursively. Let pF = jSF j. Each element of SF can directly call at most jj elements of ST . So, pT =pF jj. From lines 4-6 in NextSux(current,v), each element of SF yields at least one distinct con ict from its call to ShadowSearch. Thus, pF kv . So, p = pT + pF (jj + 1)kv = O(kv ). The cost of execution of NextSux without including the costs of recursive calls to NextSux and ShadowSearch is O(jj) (= O(1)) as there are at most jj sux edges leaving a vertex. So, the total cost of execution of all invocations of NextSux spawned by NextSux(v; v ) without including the cost of recursive calls to ShadowSearch is O(pjj) = O(kv ). Next, we consider the calls to ShadowSearch that were spawned by NextSux(v; v ). Let TA be the set of invocations of ShadowSearch which do not call ShadowSearch recursively. Let qA = jTA j. Let TB be the set of invocations of ShadowSearch which do call ShadowSearch recursively. Let qB = jTB j. Let q = qA + qB . We have qA jjqB + jjp. So, q = qA + qB (jj +1)qB + jjp. From the algorithm, each element of TB yields a distinct con ict. So, qB kv . So, q (jj+1)qB +jjp = O(kv ). The cost of execution of a single call to ShadowSearch without including the cost of executing recursive calls to ShadowSearch is O(1) + O(w) + 23
O(complexity of ListCon icts of line 9) + O(Pwi=1 (complexity of ListCon icts of line 20 in the ith iteration of the while loop)), where w denotes the number of iterations of the
while loop. The complexity of ListCon icts is proportional to the number of con icts
it reports. Since ListCon icts always yields at least one distinct con ict, the complexity of ShadowSearch is O(1 + joutputj). Summing over all calls to ShadowSearch spawned by NextSux(v; v ), we obtain O(q + kv ) = O(kv ). Thus, the total complexity of Algorithm C is O(n + kp ) 2
3.4 Alternative Algorithms In this section, an algorithm for computing all con icts (i.e., both subword and pre x-sux con icts) is presented. This solution is relatively simple and has competitive run times. However, it lacks the exibility required to eciently solve many of the problems listed in Sections 4, 5, and 6 . The algorithm (Algorithm D) is presented in Figure 12. Step 1 computes a list of all occurrences of all displayable entities in S . This list is obtained by rst computing the lists of occurrences corresponding to each vertex of V (S ) (except the source and the sink) and then concatenating these lists. Each occurrence is represented by its start and end positions. Step 2 sorts the list of occurrences obtained in step 1 in increasing order of their start positions. Occurrences with the same start positions are sorted in decreasing order of their end positions. This is done using radix sort. Step 3 computes for the i'th occurrence, occi , all its pre x sux con icts with occurrences whose starting positions are greater than its own, and all its subword con icts with its subwords. occi is checked against occi+1 , occi+2,..., occi+c for a con ict. Here, c is the smallest integer for which there is no con ict between occi and occi+c . The start position of occi+c is greater than the ending position of occi. The start position of the occj (j i + c) will also be greater than the end position of occi, since the list of occurrences was sorted on increasing order of start positions. The start positions of occi+1 ,.., occi+c?1 are greater than or equal to the start positions of occi but are less than or equal to its end position. Those occurrences among focci+1; :::; occi+c?1g whose start positions are equal to that of occi have end positions that are smaller (since occurrences with the same start position are sorted in decreasing order 24
of their end positions). The remaining con icts of occi (i.e., subword con icts with its superwords, pre x sux con icts with occurrences whose start positions are less than that of occi) have already been computed in earlier iterations of the for statement in Algorithm D. For example, let the input to step 3 be the following list of ordered pairs:((1,6), (1,3), (1,1), (2,2), (3,8), (3,5), (4,6), (5,8), (6,10)), where the rst element of the ordered pair denotes the start position and the second element denotes the end position of the occurrence. Consider the occurrence (3,5). Its con icts with (1,6), (1,3), and (3,8) are computed in iterations 1, 2, and 5 of the for loop. Its con icts with (4,6) and (5,8) are computed in iteration 6 of the for loop.
Theorem 4 Algorithm D takes O(n + k) time, where k = kp + ks. Proof Step 1 takes O(n + o) time, where o is the number of occurrences of displayable
entities of S . Step 2 also takes O(n + o) time, since o elements are to be sorted using radix sort with n buckets. Step 3 takes O(o + k) time: the for loop executes O(o) times; each iteration of the while loop yields a distinct con ict. So, the total complexity is O(n + o + k). We now show that o = O(n + k). Let o1 be the number of occurrences not involved in a con ict. Then o1 n. Let o2 be the number of occurrences involved in at least one con ict. A single con ict occurs between two occurrences. So 2k o2 . So, o = o1 + o2 n + 2k = O(n + k). 2 Algorithm D can be modi ed so that the size of the output is kp + ksc . This may be achieved by checking whether an occurrence is the rst representative of its pattern in the for loop of step 3. The subword con icts are only reported for the rst occurence of the pattern. However, the time complexity of Algorithm D remains O(n + k). In this sense, it is suboptimal.
25
Algorithm D Step 1: Obtain a list of all occurrences of all displayable entities in the string. This list is obtained
by rst computing the lists of occurrences corresponding to each vertex of the scdawg (except the source and the sink) and then concatenating these lists. Step 2: Sort the list of occurrences using the start positions of the occurrences as the primary key (increasing order) and the end position as the secondary key (decreasing order). This is done using radix sort.
Step3: for := 1 to (number of occurrences) do begin := + 1; while( ( i) ( j ) do begin if ( ( i) ( j )) then i is a superword of j else ( i j ) have a pre x-sux con ict; := + 1; end; end; i
j
i
lastpos occ
lastpos occ occ
f irstpos occ
lastpos occ
occ
occ ; occ
j
j
Figure 12: A simple algorithm for computing con icts
4 Size Restricted Queries Experimental data show that random strings contain a large number of displayable entities of small length. In most applications, small displayable entities are less interesting than large ones. Hence, it is useful to list only those displayable entities whose lengths are greater than some integer, k. Similarly, it is useful to report exactly those con icts in which the con icting displayable entities have length greater than k. This gives rise to the following problems: P1: List all occurrences of displayable entities whose lengths are greater than k. P2: Compute all pre x sux con icts involving displayable entities of length greater than k. P3: Compute all subword con icts involving displayable entities of length greater than k. The overlap of a con ict is de ned as the string common to the con icting displayable entities. The overlap of a subword con ict is the subword displayable entity. The overlap of 26
a pre x-sux con ict is its intersection. The size of a con ict is the length of the overlap. An alternative formulation of the size restricted problem which also seeks to achieve the goal outlined above is based on reporting only those con icts whose size is greater than k. This formulation of the problem is particularly relevant when the con icts are of more interest than the displayable entities. It also establishes that all con icting displayable entities reported have size greater than k. We have the following problems: P4: Obtain all pre x-sux con icts of size greater than some integer k. P5: Obtain all subword con icts of size greater than some integer k. P1 is solved optimally by invoking Occurrences(S; v; 0) for each vertex, v , in V (S ), where jde(v)j > k. A combined solution to P2 and P3 uses the approach of Section 3.4. The only modi cation to the algorithm of Figure 12 is in step 1 which now becomes: Obtain all occurrences of displayable entities whose lengths are greater than k. The resulting algorithm is optimal with respect to the expanded representation of subword con icts. However, as with the general problem, it is not possible to obtain separate optimal solutions to P2 and P3 by using the techniques of Section 3.4. An optimal solution to P4 is obtained by executing line 3 of Algorithm C of Figure 10 for only those vertices, v , in V (S ) which have jde(v )j > k. An optimal solution to P5 is obtaind by the following modi cation to Algorithm B of Figure 6: (i) Right extension or sux extension edges < u; v >, where jde(u)j k and jde(v )j > k are marked \disabled". (ii) The de nition of SG(S; v ) is modi ed so that SG(S; v ), v V (S ), is de ned as the subgraph of SCD(S ) which consists of the set of vertices, SV (S; v ) V (S ) which represent displayable entities of length greater than k that are subwords of de(v ) and the set of all re and sux extension edges that connect any pair of vertices in SV (S; v ). (iii) Algorithm B is modi ed. The modi ed algorithm is shown in Figure 13. We note that P2 and P5 are identical, since the overlap of a subword con ict is the same as the subword displayable entity.
27
Algorithm B 1 begin 2 for each vertex, , in ( ) do 3 .subword = false; 4 for each vertex, , in ( ) such that j ( )j do 5 for all vertices, , such that a non disabled right or sux extension edge, 6 if ( =6 source) then ".subword = true; 7 for each vertex, , in ( ) such that = 6 and is do 8 GetSubwords( ); 9 end v
SC D S
v
v
SC D S
de v
> k
u
u
< u; v >
exists do
v
v
SC D S
v
sink
v:subword
true
v
Figure 13: Modi ed version of algorithm B
5 Pattern Oriented Queries These queries are useful in applications where the fact that two patterns have a con ict is more important than the number and location of con icts. The following problems arise as a result: P6: List all pairs of displayable entities which have subword con icts. P7: List all triplets of displayable entities (D1,D2,Dm) such that there is a pre x sux con ict between D1 and D2 with respect to Dm . P8: Same as P6, but size restricted as in P5. P9: Same as P7, but size restricted as in P4. P6 may be solved optimally by reporting for each vertex v in V (S ), where v does not represent the sink of CSD(s), the subword displayable entities of de(v ), if any. This is accomplished by reporting de(w), for each vertex w, w 6= source, in SG(S; v ). P7 may also be solved optimally by modifying procedure ListCon icts of Figure 11 so that it reports the con icting displayable entiities and their intersection. P8 and P9 may also be solved by making similar modi cations to the algorithms of the previous section.
28
6 Statistical Queries These queries are useful when conclusions are to be drawn from the data based on statistical facts. Let f (D) denote the frequency (number of occurrences) of D in the string and rf (D1; D2) the number of occurrences of displayable entity D1 in displayable entity D2. The following queries may then be de ned. P10: For each pair of displayable entities, D1 and D2, involved in a subword con ict (D1 is the subword of D2), obtain p(D1; D2) = (number of occurrences of D1 which occur as subwords of D2) / f (D1). P11: For each pair of displayable entities, D1 and D2, involved in a pre x-sux con ict, obtain q (D1; D2) = (number of occurrences of D1 which have pre x-sux con icts with D2) /f (D1). If p(D1; D2) or q (D1; D2) is greater than a statistically determined threshold, then the following could be be said with some con dence: Presence of D1 implies Presence of D2 . Let psf (D1; D2; Dm) denote the number of pre x sux con icts between D1 and D2 with respect to Dm and psf (D1; D2), the number of pre x sux con icts between D1 and D2. We can approximate p(D1; D2) by rf (D1; D2) f (D2)=f (D1). The two quantities are identical unless a single occurrence of D1 is a subword of two or more distinct occurrences of D2. Similarly, we can approximate q (D1; D2) by psf (D1; D2)=f (D1). The two quantities are identical unless a single occurrence of D1 has pre x sux con icts with two or more distinct occurrences of D2. f (D1) can be computed for all displayable entities in SCD(S ) in O(n) time by a single traversal of SCD(S ) in reverse topological order. rf (D1; D2) may be computed optimally for all D1, D2, by modifying procedure GetSubwords(v ) as shown in Figure 14.
psf (D1; D2; Dm) is computed optimally, for all D1, D2, and Dm, where D1 has a pre x sux con ict with D2 with respect to Dm , by modifying ListCon icts(u; z; w) of Figure 11 so that it returns f (de(u)), since this is the number of con icts between de(w) and de(z ) with respect to de(v ). psf (D1; D2) is calculated by summing psf (D1; D2; Dm) over all intersections, Dm , of pre x sux con icts between D1 and D2 . p(D1; D2) and q (D1; D2) may be computed by simple modi cations to the algorithms used to compute rf (D1; D2) 29
Procedure GetSubwords( ) 1 begin v
2 3 4 5 6 7 8 9 10 11 12
( ( ) ( )) = 1; SetUp( ); SetSuxes( ); for each vertex, (6= source), in reverse topological order of rf de v ; de v v
v
( ) do begin if ( ) is a sux of ( ) then ( ( ) ( )) = 1 else ( ( ) ( )) = 0; for each vertex, ,in ( ) on which an re edge, from is incident do ( ( ) ( )):= ( ( ) ( )) + ( ( ) ( )); output( ( ( ) ( ))); end; x
de x
de v
w
rf de x ; de v
SG S; v
rf de x ; de v
SG S; v
rf de x ; de v
e
rf de x ; de v
x
rf de w ; de v
rf de x ; de v
end
Figure 14: Modi cation to GetSubwords(v ) for computing relative frequencies and psf (D1; D2). These problems may be solved under the size restrictions of P4 and P5 by modi cations similar to those made in Section 4.
7 Experimental Results Algorithms B (Section 3.2), C (Section 3.3), and D (Section 3.4) were programmed in GNU C++ and run on a SUN SPARCstation 1. For test data we used 120 randomly generated strings. The alphabet size was chosen to be one of f5, 15, 25, 35g and the string length was 500, 1000, or 2000. The test set of strings consisted of 10 dierent strings for each of the 12 possible combinations of input size and alphabet size. For each of these combinations, the average run times for the 10 strings is given in Figures 15-18. Figure 15 gives the average times for computing all con icts by combining algorithms B and C. Figure 16 gives the average times for computing all pre x-sux con icts using Algorithm C. Figure 17 gives the average times for computing all the pattern restricted pre x-sux con icts (problem P7 of Section 5) by modifying Algorithm C as described in Section 5. Figure 18 represents the average times for Algorithm D. Figures 15 to 17 represent the theoretically superior solutions to the corresponding problems, while Figure 18 represents Algorithm D which provides a simpler, but suboptimal, 30
Size of Alphabet 5 15 25 35
Size of String 500 1000 2000 410 989 2722 292 603 1300 315 671 1485 234 791 1740
Figure 15: Time in ms for computing all con icts using the optimal algorithm solution to the three problems. In all cases the time for constructing scdawgs and writing the results to a le were not included as these steps are common to all the solutions. The results show that the suboptimal Algorithm D is superior to the optimal solution for computing all con icts or all pre x-sux con icts for a randomly generated string. This is due to the simplicity of Algorithm D and the fact that the number of con icts in a randomly generated string is small. However, on a string such as a100 which represents the worst case scenario in terms of the number of con icts reported, the following run times were obtained: All con icts, optimal algorithm: 14,190 ms All pre x-sux con icts, optimal algorithm: 10,840 ms All pattern restricted pre x-sux con icts, optimal algorithm: 5,000 ms Algorithm D: 26,942 ms The experimental results using random strings also show that, as expected, the optimal algorithm fares better than Algorithm D for the more restricted problem of computing pattern oriented pre x-sux con icts. We conclude that Algorithm D should be used for the more general problems of computing con icts while the optimal solutions should be used for the restricted versions. Hence, Algorithm D should be used in an automatic environment, while the optimal solutions should be used in interactive or semi-automatic environments.
31
Size of Alphabet 5 15 25 35
Size of String 500 1000 2000 247 730 1873 219 454 989 255 522 1179 186 648 1370
Figure 16: Time in ms for computing all pre x sux con icts using the optimal algorithm
Size of Size of String Alphabet 500 1000 2000 5 163 399 1058 15 103 231 550 25 91 267 628 35 61 226 735
Figure 17: Time in ms for computing all pattern restricted pre x sux con icts using the optimal algorithm
Size of Alphabet 5 15 25 35
Size of String 500 1000 2000 203 551 1367 217 409 897 227 400 887 145 478 994
Figure 18: Time in ms for algorithm D
32
8 Conclusions In this paper, we have described ecient algorithms for the analysis and visualization of patterns in strings. We are currently extending these to other discrete objects such as circular strings and graphs. Extending these techniques to the domain of approximate string matching would be useful, but appears to be dicult.
References [1] D. Mehta and S. Sahni, \String Visualization," In Preparation, 1991. [2] B. Clift, D. Haussler, T.D. Schneider, and G.D. Stormo , \Sequence Landscapes," Nucleic Acids Research, vol. 14, no. 1, pp. 141{158, 1986. [3] G.M. Morris , \The Matching of Protein Sequences using Color Intrasequence Homology Displays," J. Mol. Graphics, vol. 6, pp. 135{142, 1988. [4] A. Blumer, J. Blumer, D. Haussler, R. McConnell, and A. Ehrenfeucht, \Complete Inverted Files for Ecient Text Retrieval and Analysis," J. ACM, vol. 34, no. 3, pp. 578{ 595, 1987. [5] A. Blumer, J. Blumer, D. Haussler, A. Ehrenfeucht, M.T. Chen, J. Seiferas, \The Smallest Automaton Recognizing the Subwords of a Text," Theoretical Computer Science, no. 40, pp. 31{55, 1985. [6] M. E. Majster and A. Reiser, \Ecient on-line construction and correction of position trees," SIAM Journal on Computing, vol. 9, pp. 785{807, Nov. 1980. [7] E. McCreight, \A space-economical sux tree construction algorithm," Journal of the ACM, vol. 23, pp. 262{272, Apr. 1976. [8] M. T. Chen and Joel Seiferas, \Ecient and elegant subword tree construction," in Combinatorial Algorithms on Words (A. Apostolico and Z. Galil, eds.), NATO ASI Series, Vol. F12, pp. 97{107, Berlin Heidelberg: Springer-Verlag, 1985. [9] E.Horowitz, S. Sahni, Fundamentals of Data Structures in Pascal, 3'rd Edition. Computer Science Press, 1990.
33