Streaming k-mismatch with error correcting and applications

30 downloads 4192 Views 210KB Size Report
Nov 11, 2016 - Page 1. arXiv:1607.05626v2 [cs.DS] 11 Nov 2016. Streaming k-mismatch with error correcting and applications. Jakub Radoszewski1 and ...
Streaming k-mismatch with data recovery and applications Jakub Radoszewski1 and Tatiana Starikovskaya2

arXiv:1607.05626v1 [cs.DS] 19 Jul 2016

1

King’s College London, Department of Informatics, London, U.K. [email protected] 2 University of Bristol, Department of Computer Science, Bristol, U.K [email protected]

Abstract We revisit the k-Mismatch problem, one of the most basic problems in pattern matching. Given a pattern of length m and a text of length n, the task is to find all m-length substrings of the text that are at the Hamming distance at most k from the pattern. Our first contribution is two new streaming algorithms for the k-Mismatch problem. • For k = 1, we give an O(polylogm)-space algorithm that uses O(polylogm) worst-case time per arriving symbol; • For k ≥ 2, the problem can be solved in O(k 2 polylogm) space and O(k polylogm) worstcase time per arriving symbol. The complexities of our new k-Mismatch algorithms match (up to logarithmic factors) those of the best known solutions from FOCS 2009 and SODA 2016, and our algorithms are enhanced with an important new feature which we call Data Recovery: For each alignment where the Hamming distance is at most k, the algorithms output the difference of symbols of the pattern and of the text in the mismatching positions. This is a particularly surprising feature as we are not allowed to store a copy of the pattern or the text in the streaming setting. Armed with this new feature, we develop the first streaming algorithms for pattern matching on weighted strings. Unlike regular strings, a weighted string is not a sequence of symbols but rather a sequence of probability distributions on the alphabet. In the Weighted Pattern Matching problem we are given a text and a pattern, both of which can be either weighted or regular strings, and must find all alignments of the text and of the pattern where they match with probability above a given threshold 1/z. • When only the pattern is weighted, we show an O(z log m)-space algorithm that takes O(log log(z + m)) worst-case time per symbol;

• For the case when only the text is weighted, we give a (1 − ε)-approximation algorithm z log z + polylog(z, m)) space and runs in O(z log z + polylog(z, m)) which uses O( log(1/(1−ε)) worst-case time per symbol; • Finally, we combine our solutions to develop a (1−ε)-approximation algorithm for the case z log z when both the pattern and the text are weighted. The algorithm requires O( log(1/(1−ε)) + z · polylog(z, m)) space and O(z · polylog(z, m)) worst-case time per arriving symbol. The algorithms presented in the paper are randomised and correct with high probability.

1

Introduction

In this work we design efficient streaming algorithms for a number of problems of approximate pattern matching, including the famous k-Mismatch problem. In this class of problems we are given a pattern and a text and wish to find all substrings of the text that are “similar” to the pattern. In the streaming model it is assumed that the text arrives as a stream, one symbol at a time. This model of computation was introduced for the purposes of processing massive data. In this setting the space usage must be kept at minimum and therefore we cannot even afford to store a copy of the text or of the pattern. We also require algorithms to compute the answer for the current position before the next symbol arrives. The first small-space streaming algorithms for pattern matching were suggested in the pioneering paper by Porat and Porat in FOCS 2009 [18]. In particular, they showed an algorithm for the Pattern Matching problem, where one must find all exact occurrences of the pattern in the text. For a pattern of length m their algorithm takes only O(log m) space and O(log m) time per each symbol of the text, and reports all exact occurrences of the pattern in the text as they occur. In 2010, Ergun et al. presented a slightly simpler version of the algorithm [13], and finally in 2011 the running time of the algorithm was improved to constant [6]. The next logical step was to study the complexity of approximate pattern matching in the streaming model in which we are to find all substrings of the text that are at a small distance from the pattern. The most popular distances are the Hamming distance, L1 , L2 , L∞ -distances, and the edit distance. Unfortunately, the general task of computing these distances for each alignment of the pattern and of the text precisely requires at least Ω(m) space. This lower bound holds even for randomised algorithms [8]. However, this is not the case if we allow approximation or if it is sufficient to compute the exact distances only when they are small. Here we focus on approximate pattern matching under the Hamming distance. If we are interested in computing a (1 + ε)-approximation of the Hamming dis√ tance for all alignments of the pattern and of the text, then this can be done in O(ε−5 m log4 m) space and O(ε−4 log3 m) time per arriving symbol [9]. And if we are only interested in computing the Hamming distances at the alignments where they do not exceed a given threshold k, then Porat and Porat showed that there is a randomised streaming algorithm that solves this problem in O(k3 log7 m/ log log m) space and O(k2 log5 m/ log log m) time per arriving symbol; they also presented an algorithm for a special case of k = 1 with O(log4 m/ log log m) space and O(log3 m/ log log m) time per arriving symbol [18]. Recently, their result was improved (in terms of √ the dependency on k) by Clifford et al. to O(k2 log11 m/ log log m) space and O( k log k + log5 m) time per arriving symbol [10]. The problem of computing the Hamming distances at all alignments where it is at most k is called the k-Mismatch problem. Our first contribution is two new streaming algorithms for the k-Mismatch problem. The crucial feature of our algorithms is that, for each alignment where the Hamming distance is at most k, they can also output the differences of symbols of the pattern and of the text in the mismatching positions. This is particularly surprising as we are not allowed to store a copy of the pattern or of the text in the streaming setting. The k-Mismatch problem extended with computing this additional characteristic is called here the k-Mismatch with Data Recovery problem. We first develop a solution for k = 1. Theorem 1.1. For a given pattern P of length m, and a streaming text T of length n arriving one symbol at a time, there is a randomised algorithm that takes O(log5 m/ log log m) space and O(log5 m/ log log m) worst-case time per arriving symbol and solves the 1-Mismatch with Data Recovery problem. The probability of error is at most 1/poly(n). Next we derive a k-Mismatch with Data Recovery algorithm for arbitrary k via an existing 1

randomised reduction to the 1-Mismatch with Data Recovery problem [18, 10]. Theorem 1.2. For a given pattern P of length m, and a streaming text T of length n arriving one symbol at a time, there is a randomised algorithm that takes O(k2 log10 m/ log log m) space and O(k log8 m/ log log m) time per symbol and solves the k-Mismatch with Data Recovery problem. The probability of error is at most 1/poly(m). Since in the k-Mismatch with Data Recovery problem we need at least Ω(k) time per arrival to list the symbol differences, the running time of our algorithm is almost optimal. The time complexity of our k-Mismatch with Data Recovery algorithm is better than the time complexity of the k-Mismatch algorithm of [18] and worse than that of [10]. Our algorithm also uses less space than the k-Mismatch algorithm of [18] (in terms of k) and the k-Mismatch algorithm of [10]. For the former, it is explained by the fact that we use a more efficient reduction. For the latter, it is because we do not need the elaborated time-saving techniques of Clifford et al. [10] (as the running time of our algorithm is already almost optimal). The Data Recovery feature is an extremely powerful tool. We demonstrate this by developing the first streaming algorithms for the problem of pattern matching on weighted strings, which is our second contribution. A weighted string (also known as weighted sequence, position probability matrix or position weight matrix, PWM) is a sequence of distributions on the alphabet. Weighted strings are a commonly used representation of uncertain sequences in molecular biology. In particular, they are used to model motifs and have become an important part of many software tools for computational motif discovery, see e.g. [19, 20]. In the Weighted Pattern Matching problem we are given a text and a pattern, both of which can be either weighted or regular strings. If either only the text or only the pattern are weighted, the task is to find all alignments of the text and of the pattern where they match with probability above a given threshold 1/z. In the most general case, when both the pattern and the text are weighted, we must find all alignments of the text and of the pattern where there exists a regular string that matches both the text and the pattern with probability at least 1/z. As it was already mentioned, we are the first to consider the Weighted Pattern Matching problem in the streaming setting. In the offline setting the most commonly studied variant, when the text is a weighted string of length n and the pattern is a regular string, can be solved in O(n log n) time via the Fast Fourier Transform [7] or in O(n log z) time using the suffix array and lookahead scoring [16]. This variant has been also considered in the indexing setting, in which we are to preprocess a weighted text to be able to answer queries for multiple patterns; see [1, 14, 5, 3]. The symmetric variant of the Weighted Pattern Matching problem, when only the pattern is weighted, is closely related to the problem of profile matching [17] and admits O(n log n)-time and O(n log z)-time solutions as well. Finally, the variant when both the text and the pattern are weighted was introduced in [4], where an O(nz 2 log z)-time solution was presented. Later a more √ efficient O(n z log log z)-time solution was devised in [16]. The offline algorithms use Ω(m) space and the best indexing solution uses Ω(nz) space. There has been also considered a problem of computing Hamming and edit distances for weighted strings [2]. Our key result for the Weighted Pattern Matching problem is a (1 − ε)-approximation streaming algorithm for the case when the pattern is a regular string and the text is weighted. At each alignment the algorithm outputs either “Yes” or “No”. If the pattern matches the fragment of the text, the algorithm outputs “Yes”. If the match probability is between (1 − ε)/z and 1/z, it can output either “Yes” or “No”, and otherwise it outputs “No”. Our solution combines the new streaming algorithm for the k-Mismatch with Data Recovery problem with new techniques for approximating the product of probabilities in a sliding window. If it outputs “Yes”, it also outputs a (1 − ε)-approximation of the match probability between P and T . 2

Theorem 1.3. For a pattern P of length m, and a weighted text T arriving one position at a z log z time, there is a randomised (1 − ε)-approximation algorithm that takes O( log(1/(1−ε)) + log2 z · log10 m/ log log m) space and O(z log z + log z · log8 m/ log log m) time per symbol and solves the Weighted Pattern Matching problem. The error probability is at most 1/poly(m).

The case when pattern is weighted is the simplest of all three, and in Section 2 we show that it can be solved via the streaming algorithm for the Dictionary Matching problem. In the Dictionary Matching problem we are given a set of patterns called a dictionary and must find all occurrences of the patterns in the text. Fact 1.1. For a weighted pattern P of length m, and a text T arriving one symbol at a time, there is a randomised algorithm that takes O(z log m) space and O(log log(z + m)) time per symbol and solves the Weighted Pattern Matching problem. The error probability is at most 1/poly(n). Our final result is a (1 − ε)-approximation streaming algorithm for the most general case when both the pattern and the text are weighted. At each alignment the algorithm outputs either “Yes” or “No”. If there is a regular string that matches the text and the pattern at this alignment with probability above 1/z, the algorithm outputs “Yes”. It may also output “Yes” if there is a string that matches with probability between (1 − ε)/z and 1/z. Otherwise it outputs “No”. Theorem 1.4. For a weighted pattern P of length m, and a weighted text T arriving one position z log z + z log2 z · at a time, there is a randomised (1 − ε)-approximation algorithm that takes O( log(1/(1−ε))

log10 m/ log log m) space and O(z log z·log8 m/ log log m) time per symbol and solves the Weighted Pattern Matching problem. The error probability is at most 1/poly(m).

For problems on weighted sequences, we assume the word RAM model with word size w = Ω(log n + log z). We consider the log-probability model of representations of weighted sequences, that is, we assume that probabilities in the weighted sequences and the threshold probability 1/z are dw all of the form cp/2 , where c and d are constants and p is an integer that fits in a constant number of machine words. Additionally, the probability 0 has a special representation. The only operations on probabilities in our algorithms are multiplications and divisions, which can be performed exactly in O(1) time in this model. We also assume a constant-sized alphabet Σ. (This assumption may be dropped at a cost of an additional |Σ| factor in the time complexity.)

2

Overview of the main ideas

In this section we first give an overview of the main technical ideas needed to prove Theorems 1.1 and 1.2. We then prove Fact 1.1 and outline the proofs of Theorems 1.3 and 1.4. Recall that a (regular) string X of length |X| = n is a sequence of symbols X[1], . . . , X[n] from an alphabet Σ. By X[i, j], for 1 ≤ i ≤ j ≤ n, we denote the string X[i] . . . X[j] called a substring of X. If i = 1, the substring is called a prefix of X, and if j = n, it is called a suffix of X.

2.1

Streaming k-Mismatch with Data Recovery

1-Mismatch with Data Recovery (Theorem 1.1). We start by introducing a notion of 1mismatch sketches. Our definition exploits the Karp-Rabin fingerprint, a rolling hash function ([15]; see also Appendix A). Definition 2.1 (Karp-Rabin fingerprint). Let p be a prime and r a random integer in Fp . We P define the Karp-Rabin fingerprint of a string X = X[1] . . . X[ℓ] as ϕ(X) = ( ℓi=1 X[i] · r i ) mod p. 3

The most important property of Karp-Rabin fingerprints is that for any two strings X and Y of length ℓ ≤ m such that X 6= Y , the probability that their fingerprints are equal is at most 1/f (m) if p > m · f (m). Definition 2.2 (1-mismatch sketch). For a prime q, the 1-mismatch sketch of a string X is a vector of length q, where the j-th element is the Karp-Rabin fingerprint of the subsequence X[j]X[j + q]X[j + 2q] . . . Our definition of 1-mismatch sketches was inspired by the paper of Porat and Porat [18]. Let X, Y be two m-length strings and Ham (X, Y ) be the Hamming distance between them. Porat and Porat noticed that Ham (X, Y ) = 1 if and only if for each prime q ∈ [log m, 2 log m] exactly one pair of subsequences X[j]X[j + q]X[j + 2q] . . . and Y [j]Y [j + q]Y [j + 2q] . . . mismatch and used this property to develop their 1-Mismatch algorithm. They also noticed that, by the Chinese remainder theorem, the mismatching subsequences uniquely define the mismatch position between X and Y . We use the 1-mismatch sketches both to detect 1-mismatch occurrences of the pattern and to restore the data. Unfortunately, the 1-mismatch algorithm of Porat and Porat [18] cannot compute the 1-mismatch sketches. This is because it discards all the information about the mismatching subsequences as soon as it discovers a mismatch. Our algorithm cannot compute the 1-mismatch sketches either, at least not in the general case. The main new idea is to reduce the 1-Mismatch with Data Recovery problem to O(log m) instances of a special case of this problem where the mismatch is required to belong to the second half of the pattern. More formally, consider ⌈log m⌉ + 1 partitions P = Pi Si , where Pi is a prefix of length min{2i , m} and Si is the remaining suffix, for i = 0, 1, . . . , ⌈log m⌉. We say that a substring of T is a right-half 1-mismatch occurrence of Pi if either i = 0 and P0 does not match, or 1 ≤ i < ⌈log m⌉ and the mismatch belongs to Pi [2i−1 + 1, 2i ], or i = ⌈log m⌉ and the mismatch belongs to P⌈log m⌉ [2⌈log m⌉−1 + 1, m] = P [2⌈log m⌉−1 + 1, m]. Observation 2.3. Any 1-mismatch occurrence of P in T is a right-half 1-mismatch occurrence of some Pi followed by an exact occurrence of Si . We run two parallel processes for each i. The first process will be looking for right-half 1mismatch occurrences of Pi . To detect the occurrences it uses the 1-Mismatch algorithm of Porat and Porat [18], and to compute the 1-mismatch sketches, it uses a streaming Pattern Matching algorithm for Pi that takes O(log m) space and O(log m) time per symbol (see for example [6]). This algorithm is not the fastest solution known, but it gives us the desired time bounds and is conceptually simple. The algorithm tests each position of T in O(log |Pi |) = O(log m) levels. At level j it checks, for all positions in the suffix of length 2j+1 of the current text, if they are occurrences of Pi [1, 2j ] and, if not, discards them. For us it is important that the algorithm stores the Karp-Rabin fingerprint of the current text and the Karp-Rabin fingerprints of T from the beginning of T up to each of the positions that has not been discarded yet. In particular it means that upon reaching the last position of a right-half 1-mismatch occurrence of Pi , the algorithm can retrieve its Karp-Rabin fingerprint in constant time. We use the fingerprint to compute the difference of symbols at the mismatching positions of P and T and then encode all the data as 1-mismatch sketches for the primes q ∈ [log m, 2 log m] via the following lemma. Lemma 2.4. Assume that X and Y are two strings of length m that differ only at position j. Knowing the Karp-Rabin fingerprints of X and Y , one can compute X[j] − Y [j] in O(log m) time. Proof. Let ϕ(X) and ϕ(Y ) be the Karp-Rabin fingerprints of X and Y , respectively. Then X[j] − Y [j] is equal to (ϕ(X) − ϕ(Y )) · r −j (mod p), where p and r are the integers used in the definition of Karp-Rabin fingerprints. This formula is evaluated in O(log m) time. 4

After having detected a right-half 1-mismatch occurrence of Pi , we send its 1-mismatch sketches for all primes q ∈ [log m, 2 log m] (hence, of total size O(log2 m/ log log m)) to the second process, which verifies if it is followed by an exact occurrence of Si . Note that 1-mismatch sketches can be used both to find the position of mismatch and to recover the data if we apply Lemma 2.4 to individual subsequences. The second process is also based on the Pattern Matching algorithm. The main technical hurdle is how to store the 1-mismatch sketches. We cannot store them for each position as this would require too much space. We claim that it is enough to store the 1-mismatch sketches only for a constant number of positions per level. Using this data, we will be able to retrieve the 1-mismatch sketches for all other positions (Lemma 3.1). This finally gives us the 1-Mismatch with Data Recovery algorithm. We describe and analyse the algorithm in full detail in Section 3. k-Mismatch with Data Recovery (Theorem 1.2). Theorem 1.2 follows from Theorem 1.1 via a randomised reduction. The first variant of this reduction, which was deterministic, was presented by Porat and Porat [18], who used it to reduce the k-Mismatch problem to the 1-Mismatch problem. It was further made more space-efficient at a return of slightly higher error probability by Clifford et al. [10]. The main idea of this reduction is to consider a number of partitions of the pattern into O(k log2 m) subpatterns. By defining the partitions appropriately we can guarantee that at each alignment where the Hamming distance is small, each mismatch will correspond to a 1-mismatch occurrence of some subpattern. This lets us apply the 1-Mismatch with Data Recovery algorithm to find all such alignments and to restore the data for them. We give a full description of the reduction in Appendix B.

2.2

Weighted pattern matching

We consider three cases depending on whether the pattern and the text are weighted or not. We start with a proof of Fact 1.1 that gives a streaming algorithm for the simplest variant of Weighted Pattern Matching where the pattern is weighted and the text is a regular string. Afterwards we give an overview of the main new ideas we used to show Theorems 1.3 and 1.4. Only pattern is weighted (Fact 1.1). At the preprocessing step we generate all (regular) strings that match the pattern with probability at least 1/z. We do this in a naive fashion by scanning the pattern from left to right. Note that there can be at most z of them, as the sum of match probabilities over all strings is 1. If a substring of the text T matches P , it must be an occurrence of one of the generated strings. We find all such occurrences using the streaming Dictionary Matching algorithm by Clifford et al. [11] for the dictionary of the z generated strings. For a dictionary of z strings of length m the algorithm takes O(z log m) space and O(log log(z + m)) time per symbol. The algorithm is randomised and has error probability 1/poly(n). Thus we arrive at the conclusion of Fact 1.1. Only text is weighted (Theorem 1.3). Our proof of Theorem 1.3 is based on several combinatorial observations. For a weighted string W , by H (W ) we denote a regular string obtained from W by choosing at each position the symbol with the maximum probability. The first observation is that at each alignment where P and T match with probability at least 1/z, the number of mismatches between P and H (T ) is small. To find such alignments, we use our k-mismatch algorithm. The second observation is that at each moment there are only a few left-maximal regular strings that match a suffix of T above the threshold. Moreover, for each of these strings the Hamming 5

distance between them and H (T ) must be small as well. We maintain such strings as streams, and index them with the mismatches and symbol differences between them and H (T ). If an m-length substring of T matches P with probability at least 1/z, it is a suffix of a leftmaximal string S. We will use the mismatches and the symbol differences, which are output by our k-mismatch algorithm, to find S. As a final step, we need to approximate the match probability for the m-length suffix of S. In order to do this, we introduce the Sliding Window Product problem. In the Sliding Window Product problem we are given a stream of numbers from [0, 1] arriving one number at a time, an integer m, and a threshold 1/z, and must find the product of numbers in each window of length m provided that it is above the threshold. We give a (1 − ε)-approximation algorithm for this problem. More precisely, when a new symbol arrives, the algorithm outputs either a number or “No”. If it outputs a number, it is a (1 − ε)-approximation of the product of the last m numbers, and otherwise the product is at most (1 − ε)/z. Lemma 2.5. For a stream of numbers arriving one number at a time, and a window width m, log z ) space and O(1) there is a deterministic (1 − ε)-approximation algorithm that takes O( log(1/(1−ε)) time per arrival and solves the Sliding Window Product problem. The idea is to maintain a small number of non-overlapping intervals that at each moment contain m′ most recent numbers with the total product of at most (1 − ε)/z. We guarantee that only the first interval can contain numbers outside of the current window and that each interval is either a singleton or has a sufficiently large product of numbers (at least 1 − ε). If m′ < m, then the product of the last m numbers is at most (1 − ε)/z and the algorithm outputs “No”. Otherwise, the algorithm outputs the product of the numbers in all intervals and, since the product of numbers in the first interval is large if the interval is not a singleton, the result is a good approximation. We give a full proof of Theorem 1.3 in Section 4 and a full proof of Lemma 2.5 in Appendix D. Weighted pattern and weighted text (Theorem 1.4). Our solution of the Weighted Pattern Matching problem when both the pattern and the text are weighted combines the techniques we used to show Fact 1.1 and Theorem 1.3. Similarly to Fact 1.1, we start by generating the at most z regular strings that match the pattern. Note that running a separate instance of the algorithm of Theorem 1.3 for each of them would not give us the desired space bound. Instead, we run just one instance of that algorithm but organise the mismatch information differently. As a result, we can find a stream that ends with a string generated at the preprocessing step in just O(log z) time. We elaborate on this in Section 5.

3

Proof of Theorem 1.1: 1-mismatch with data recovery

In this section we give our streaming 1-Mismatch with Data Recovery algorithm. Recall that for each i = 0, 1, . . . , ⌈log m⌉ we consider a partition P = Pi Si , where Pi is a prefix of length min{2i , m} and Si is the remaining suffix, and run two processes in parallel. The first process will be looking for right-half 1-mismatch occurrences of Pi in T . When such occurrence is identified, the algorithm sends it to the second process, which finds exact occurrences of Si and checks which of them are preceded by right-half 1-mismatch occurrences of Pi . We start with an outline of a streaming Pattern Matching algorithm that lies in the foundation of both processes. For a pattern Q and text T the algorithm takes O(log |Q|) space and O(log |Q|) per symbol. This algorithm is not the best known solution, but it gives the desired time bounds and is conceptually simple. The algorithm stores O(log |Q|) levels of positions of T . Positions in level j are occurrences of Q[1, 2j ] in the suffix of the current text T of length 2j+1 . 6

The algorithm stores the Karp-Rabin fingerprints of T from the beginning of T up to each of these positions. If there are at least 3 such positions at one level, then there is a large overlap of the occurrences of Q[1, 2j ] and therefore all the positions form a single arithmetic progression whose difference equals the length of the minimal period of Q[1, 2j ]. This lets us store the aforementioned information very compactly, using only O(log |Q|) space in total. Finally, the algorithm stores the Karp-Rabin fingerprint of the current text and of all prefixes Q[1, 2j ]. When a new symbol of T arrives, the algorithm considers the leftmost position in each level j. If the fingerprints imply that it is an occurrence of Q[1, 2j+1 ], the algorithm promotes it to the next level. When a position reaches the top level, it is an occurrence of Q and the algorithm outputs it. For more details see for example [6].

3.1

1-mismatch occurrences of Pi

Let us now describe the first process. Suppose a new symbol T [q] has arrived. If T [q − |Pi | + 1, q] is a right-half 1-mismatch occurrence of Pi , the process will return the 1-mismatch sketches of T [q − |Pi | + 1, q]. To check whether T [q − |Pi | + 1, q] is a right-half 1-mismatch occurrence of Pi , we use the 1Mismatch algorithm [18], which requires O(log4 m/ log log m) space and O(log3 m/ log log m) time per symbol. This algorithm also returns the mismatch position. However, the algorithm does not know neither the difference of symbols at the mismatch position nor the 1-mismatch sketches of T [q − |Pi | + 1, q]. To extend the algorithm with the required functionality, we use the Pattern Matching algorithm for a pattern Q = Pi and T . If T [q − |Pi | + 1, q] is a right-half 1-mismatch occurrence of Pi , then Pi [1, 2i−1 ] matches at the position q − |Pi | + 1 exactly. It follows that at time q the position q − |Pi | + 1 is stored at level i − 1 of the Pattern Matching algorithm, and we know the Karp-Rabin fingerprints of T [1, q − |Pi |] and T [1, q]. Therefore, we can compute the Karp-Rabin fingerprint of T [q − |Pi | + 1, q] in O(1) time. Let j be the mismatch position and ϕ(Pi ), ϕ(T [q − |Pi | + 1, q]) be the Karp-Rabin fingerprints of Pi and T [q − |Pi | + 1, q], respectively. We can then use Lemma 2.4 to compute the difference of symbols of Pi and T at the position j. As a final step we compute the 1-mismatch sketches of T [q − |Pi | + 1, q] and send them to the second process. We assume that during the preprocessing step the algorithm precomputes the 1mismatch sketches of Pi for all primes in [log m, 2 log m], which requires O(log2 m/ log log m) space. It can then restore the 1-mismatch sketches of T [q − |Pi | + 1, q] in O(log2 m/ log log m) time by fixing the symbol in the mismatch position.

3.2

Exact occurrences of Si

We now describe the second process that verifies whether right-half 1-mismatch occurrences of Pi found by the first process are followed by an exact occurrence of Si . We build this process on top of the Pattern Matching algorithm for the pattern Q = Si and T . Since for each new position q the first process tells whether it is preceded by a right-half 1-mismatch occurrence of Pi , all we need is to carry this information from the level 0 of the Pattern Matching algorithm to the top level. However, it is not an easy task, because the Pattern Matching algorithm stores the levels in a compressed form. We claim that it suffices to store the 1-mismatch sketches for a constant number of positions in each level. Using them, we will be able to infer the remaining unstored information. Consider level j. The progression that the algorithm currently stores for this level can be a part of a longer progression of occurrences of Si [1, 2j ] in T . Let ℓ be some occurrence of Si [1, 2j ] for which we would like to figure out whether it is preceded by a right-half 1-mismatch occurrence of Pi . If ℓ is far 7

from the start of the long progression, then the text preceding ℓ is periodic with period Si [1, ρj ] and we can use this fact to infer the 1-mismatch information. So our main concern is the positions ℓ that are at the distance of at most |Pi | = 2i from the start of the long progression. We define four positions ℓaj , a = 1, 2, 3, 4, that will help us restore the information in this case. Note that we can easily determine the moment when a new long progression starts, as this is precisely the moment when the difference between two consecutive positions in level j becomes larger than ρj . We define ℓ1j as the first position preceded by a right-half 1-mismatch occurrence of Pi that was added to level j since that moment. We further define ℓaj as the leftmost terms preceded by a right-half 1-mismatch occurrence of Pi in [ℓ1j + (a − 1) · 2i−2 , ℓ1j + a · 2i−2 ] for a = 2, 3, 4. (If such terms exist; otherwise they are left undefined). The algorithm stores the 1-mismatch sketches of T [1, ℓaj − 2i − 1] for each a = 1, 2, 3, 4 and the 1-mismatch sketches of T [1, ℓ1j − 1]. During the preprocessing step the algorithm computes a number of 1-mismatch sketches for the pattern, which in total occupy O(log3 m/ log log m) space. First, it stores the 1-mismatch sketches of the minimal period of Pi [1, 2i−1 ]. Second, the algorithm stores the 1-mismatch sketches of the minimal period Si [1, ρj ] of Si [1, 2j ] for each j = 0, 1, . . . , ⌊log |Si |⌋. Finally, it stores the 1-mismatch sketches of Si [1, δj ], where δj is the remainder of −2i modulo ρj . In Appendix C we show the following technical lemma. Lemma 3.1. Given a position ℓ in level j, the algorithm can decide if it is preceded by a righthalf 1-mismatch occurrence of Pi in O(log3 m/ log log m) time and, if this is the case, it can also compute the 1-mismatch sketches of T [1, ℓ − 2i − 1] and T [1, ℓ − 1]. As explained above, the hardest case is when ℓ is close to the positions ℓaj . If ℓ is preceded by a right-half 1-mismatch occurrence of Pi , then T [ℓ − 2i , ℓ − 2i−1 − 1] is an occurrence of Pi [1, 2i−1 ]. Recall that ℓaj are preceded by right-half 1-mismatch occurrences of Pi as well, and each of them starts with Pi [1, 2i−1 ]. Since ℓ is close to ℓaj , T [ℓ−2i , ℓ−2i−1 −1] has a large overlap with one of these occurrences. This implies that the substring between ℓaj and ℓ − 2i is a power of the minimal period of Pi [1, 2i−1 ] (see Fig. 1), which will allow us to compute the 1-mismatch sketches of T [1, ℓ − 2i − 1] from 1-mismatch sketches of T [1, ℓaj − 2i − 1]. Pi [1, 2i−1 ] Pi [1, 2i−1 ]

ℓaj − 2i

ℓ − 2i

ℓaj



Figure 1: The hard case of Lemma 3.1. Position ℓ is close to one of the positions ℓaj . The algorithm uses the lemma both to compile the output and to update the levels. When the algorithm encounters a position in the top level that is preceded by a right-half 1-mismatch occurrence of Pi and followed by an exact occurrence of Si , the algorithm outputs it together with the required difference of symbols, which is computed from the 1-mismatch sketches using Lemma 2.4. Let us show how the algorithm updates the levels. When a new symbol T [q] arrives and T [q] = Si [1], the algorithm adds q to the level 0. If q is preceded by a right-half 1-mismatch occurrence of Pi , then the algorithm also tries to update the ℓa0 values and possibly retains the 1mismatch sketches of T [1, q − 2i − 1] and T [1, q − 1] output by the first process. The algorithm then 8

updates each of the remaining levels in turn. To update level j, it considers the leftmost position ℓj in this level. If the Karp-Rabin fingerprints imply that it is an occurrence of Si [1, 2j+1 ], the algorithm promotes it to the next level, and otherwise discards it. In the former case the algorithm uses Lemma 3.1 to check whether ℓj is preceded by a right-half 1-mismatch occurrence of Pi . If it does, it computes the 1-mismatch sketches and updates the positions ℓaj+1 , a = 1, 2, 3, 4.

3.3

Space and time complexities

Recall that for each i = 0, 1, . . . , ⌈log m⌉ we run two processes in parallel. The bottleneck of the first process is the 1-Mismatch algorithm; it uses O(log4 m/ log log m) space and O(log3 m/ log log m) time per symbol. The time complexity of the second process is bounded by O(log m) applications of the algorithm of Lemma 3.1 (one per level) and is O(log4 m/ log log m) per symbol. The space complexity of the second process is O(log3 m/ log log m). Therefore, our 1-Mismatch with Data Recovery algorithm uses O(log5 m/ log log m) space and O(log5 m/ log log m) time per symbol.

4

Proof of Theorem 1.3

In this section we show Theorem 1.3 that gives a streaming algorithm for the Weighted Pattern Matching problem. In Theorem 1.3 we assume that the pattern is a regular string and the text is weighted. Recall that H (T ) is a string obtained from T by choosing at each position the symbol with the maximum probability. If P matches T at some alignment q with probability at least 1/z, then H (T ) [q − m + 1, q] is a log z-mismatch occurrence of P . This is because at each mismatch position r ∈ {1, . . . , m} we have Pr[T [q − m + r] = P [r]] ≤ 1/2. We use our k-Mismatch with Data Recovery algorithm with k = log z for P and H (T ) to find all such alignments. We now show how to check, for each of such alignments, if it indeed corresponds to an occurrence of the pattern in the weighted text. Definition 4.1. A maximal matching suffix of T [1, q] is a (regular) string S such that S matches T [q − |S| + 1, q] with probability ≥ 1/(2z) and either |S| = q or any string aS, for a ∈ Σ, matches T [q − |S|, q] with probability < 1/(2z). A crucial property of maximal matching suffixes is that their number does not exceed 2z [1]. Note that P matches T at an alignment q with probability ≥ 1/z if and only if there is a maximal matching suffix S of T [1, q] that ends with it. Moreover, the match probability between P and T is equal to the match probability between the m-length suffix of S and T . We maintain one number stream {xi } for each maximal matching suffix S. The stream has length at least |S|, and the last |S| numbers are the probabilities of the symbols of S and T to match. On top of this stream we run the Sliding Window Product algorithm of Lemma 2.5. We call a position i of the stream a mismatch position if xi ≤ 1/2. Let r be the rightmost Q mismatch position such that i≥r+1 xi ≤ 1/(2z). If such mismatch position does not exist, we put r = 0. Note that there are at most log z + 1 mismatch positions to the right of r. Each such position corresponds to a choice of some symbol of the alphabet, and we can compute the difference between this symbol and the corresponding symbol of T . We index the stream by the mismatch positions and the corresponding symbol differences. We also store the total product of numbers located to the right of each of these at most log z + 1 mismatch positions. We first explain how we compute the answer. When we identify a log z-mismatch occurrence of P in H (T ), we find a stream that corresponds to a maximal matching suffix that ends with P , if any (using the indexing), and output a (1 − ε)-approximation of the product of the last m numbers in the stream as an answer. 9

Secondly, we explain how we maintain the streams. Let Σq+1 be the set of symbols b ∈ Σ such that Pr[T [q + 1] = b] ≥ 1/(2z). At the arrival of position q + 1 we create |Σq+1 | copies of each stream. We then add Pr[T [q + 1] = b] for each b ∈ Σq+1 to the b-th copy of each stream and update the mismatch positions and all related information in O(log z) time in a straightforward manner. At this moment we might have more than one stream for a single maximal matching suffix. On the other hand, the streams that correspond to a fixed maximal matching suffix have the same indexes. We can therefore delete the duplicates in the following way. The first step is to build a trie on the indices of the streams in O(z log z) time and space to sort the streams (see [12] for a definition). The second step is to select one stream for each index (any of them) and delete the remaining ones. The algorithm can output an incorrect answer only when the k-mismatch algorithm errs, which happens with probability 1/poly(m). We now analyse the algorithm. For k = log z the k-mismatch algorithm takes O(log2 z · log10 m/ log log m) space and O(log z · log8 m/ log log m) time per arriving symbol. Since the number of maximal matching suffixes never exceeds 2z, the total number of streams is O(z). Updating the streams, including the duplicate removal, takes O(z log z) time per log z ) space arrival. For each of the streams we run the algorithm of Lemma 2.5, which takes O( log(1/(1−ε)) and O(1) time per symbol. Finally, to find the “right” stream we need O(z log z) time per arrival. In z log z total, this is O( log(1/(1−ε)) + log2 z · log 10 m/ log log m) space and O(z log z + log z · log 8 m/ log log m) time per symbol.

5

Proof of Theorem 1.4

We conclude by showing a streaming algorithm for the Weighted Pattern Matching problem where both the pattern and the text are weighted. We start by generating the at most z (regular) strings that match the pattern in a naive fashion by scanning the pattern from left to right. If P matches T at some alignment q, then H (T ) [q − m + 1, q] must be a log z-occurrence of one of the generated strings. We use our kMismatch with Data Recovery algorithm with k = log z for each of the generated strings and H (T ) to find all such occurrences. At each alignment q we consider all generated strings that match the text and index each of them with the mismatches and symbol differences. For each alignment, we build a trie of the indices. We also maintain a set of streams of maximal matching suffixes in the text, like we did in Section 4. For each of the streams we run the algorithm of Lemma 2.5. At each alignment q we consider each of the matching suffixes in turn. Recall that each regular string that matches T [q − m + 1, q] is a suffix of one of the maximal matching suffixes of T . Using the algorithm of Lemma 2.5 we find an approximation of the match probability for each of such strings, and if the Sliding Window Product algorithm does not output “No”, we search for the string’s index (composed of the mismatch positions and symbol differences) in the trie we built at the previous step. If the search succeeds, P and T [q − m + 1, q] match. We now analyse the algorithm. The algorithm can output an incorrect answer only when one of the k-mismatch algorithms errs, which happens with probability z/poly(m) = 1/poly(m). For k = log z the k-mismatch algorithms take O(z log2 z · log10 m/ log log m) space and O(z log z · log8 m/ log log m) time per symbol. Building the trie of mismatch encodings takes O(z log z) time. Updating the streams takes O(z log z) time as there are O(z) of them. For each of the streams we log z ) space and O(1) time. Searching for the run the algorithm of Lemma 2.5, which takes O( log(1/(1−ε)) z log z indices in the trie takes O(z log z) time. In total, this is O( log(1/(1−ε)) + z log2 z · log10 m/ log log m)

space and O(z log z · log8 m/ log log m) time per symbol. 10

References [1] A. Amir, E. Chencinski, C. S. Iliopoulos, T. Kopelowitz, and H. Zhang. Property matching and weighted matching. Theor. Comput. Sci., 395(2-3):298–310, 2008. [2] A. Amir, C. Iliopoulos, O. Kapah, and E. Porat. Approximate matching in weighted sequences. In CPM’06: Proc. of the17th Annual Symposium on Combinatorial Pattern Matching, volume 4009 of LNCS, pages 365–376, 2006. [3] C. Barton, T. Kociumaka, S. P. Pissis, and J. Radoszewski. Efficient index for weighted sequences. In CPM’16: Proc. of the 27th Annual Symposium on Combinatorial Pattern Matching, LIPIcs, pages 4:1–4:13, 2016. [4] C. Barton and S. P. Pissis. Linear-time computation of prefix table for weighted strings. In WORDS’15: Proc. of the 10th International Conference on Combinatorics on Words, volume 9304 of LNCS, pages 73–84, 2015. [5] S. Biswas, M. Patil, S. V. Thankachan, and R. Shah. Probabilistic threshold indexing for uncertain strings. In EDBT’16: Proc. of the 19th International Conference on Extending Database Technology, pages 401–412, 2016. [6] D. Breslauer and Z. Galil. Real-time streaming string-matching. In CPM’11: Proc. of the 22nd Annual Symposium on Combinatorial Pattern Matching, volume 6661 of LNCS, pages 162–172, 2011. [7] M. Christodoulakis, C. S. Iliopoulos, L. Mouchard, and K. Tsichlas. Pattern matching on weighted sequences. In CompBioNets’04: Proc. of the 1st International Conference on Algorithms and Computational Methods for Biochemical and Evolutionary Networks, KCL publications, 2004. [8] R. Clifford, M. Jalsenius, E. Porat, and B. Sach. Space lower bounds for online pattern matching. Theor. Comput. Sci., 483:58–74, 2013. [9] R. Clifford and T. Starikovskaya. Approximate Hamming distance in a stream. In ICALP’16: Proc. of the 43rd International Colloquium on Automata, Languages, and Programming, LIPIcs, pages 20:1–20:13, 2016. [10] R. Clifford, A. Fontaine, E. Porat, B. Sach, and T. Starikovskaya. The k -mismatch problem revisited. In SODA’16: Proc. of the 27th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2039–2052, 2016. [11] R. Clifford, A. Fontaine, E. Porat, B. Sach, and T. A. Starikovskaya. Dictionary matching in a stream. In ESA’15: Proc. of the 23rd Annual European Symposium on Algorithms, volume 9294 of LNCS, pages 361–372, 2015. [12] M. Crochemore, C. Hancart, and T. Lecroq. Algorithms on Strings. Cambridge University Press, 2007. [13] F. Erg¨ un, H. Jowhari, and M. Saglam. Periodicity in streams. In APPROX’10: Proc. of the 13th International Workshop on Approximation, Randomization, and Combinatorial Optimization, volume 6302 of LNCS, pages 545–559, 2010.

11

[14] C. S. Iliopoulos, C. Makris, Y. Panagis, K. Perdikuri, E. Theodoridis, and A. K. Tsakalidis. The weighted suffix tree: An efficient data structure for handling molecular weighted sequences and its applications. Fundam. Inform., 71(2-3):259–277, 2006. [15] R. M. Karp and M. O. Rabin. Efficient randomized pattern-matching algorithms. IBM J. of Research and Development, 31(2):249–260, 1987. [16] T. Kociumaka, S. P. Pissis, and J. Radoszewski. Pattern matching and consensus problems on weighted sequences and profiles. CoRR, abs/1604.07581v2, 2016. [17] C. Pizzi and E. Ukkonen. Fast profile matching algorithms - A survey. Theor. Comput. Sci., 395(2-3):137–157, 2008. [18] B. Porat and E. Porat. Exact and approximate pattern matching in the streaming model. In FOCS’09: Proc. of the 50th Annual Symposium on Foundations of Computer Science, pages 315–323, 2009. [19] A. Sandelin, W. Alkema, P. Engstr¨om, W. W. Wasserman, and B. Lenhard. JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucl. Acids Res., 32(1):D91–D94, 2004. [20] X. Xia. Position weight matrix, Gibbs sampler, and the associated significance tests in motif characterization and prediction. Scientifica, 2012. Article ID 917540.

A

Karp-Rabin fingerprints and 1-mismatch sketches

In this section we list several properties of Karp-Rabin fingerprints and 1-mismatch sketches that allow to efficiently compute them. The following lemma is a straightforward corollary of the definition of Karp-Rabin fingerprints. Lemma A.1. Let X, Y be two strings, Z = XY be their concatenation, and let ϕ(X), ϕ(Y ), and ϕ(Z) be the Karp-Rabin fingerprints of X, Y , and Z, respectively. Then ϕ(Z) = (ϕ(X) + r |X| · ϕ(Y )) mod p. Therefore if we assume that together with a Karp-Rabin fingerpint of a string of length ℓ we store r ℓ mod p and r −ℓ mod p, then knowing the Karp-Rabin fingerprints of two of the strings X, Y , Z, one can compute the Karp-Rabin fingerprint of the third string in O(1) time. Lemma A.2. Let X, Y be two strings, Z = XY be their concatenation. Consider the 1-mismatch sketches of X, Y , and Z defined for a prime q, then: (i) If X = Y , then their 1-mismatch sketches are equal; (ii) Given the 1-mismatch sketches of X and Y , we can compute the 1-mismatch sketch of Z in O(q) time; (iii) Given the 1-mismatch sketches of X and Z, we can compute the 1-mismatch sketch of Y in O(q) time; (iv) Given the 1-mismatch sketch of X, we can compute the 1-mismatch sketch of X α in O(q log α) time, where α is an integer and X α is a concatenation of α copies of X. 12

Proof. Property (i) is a direct corollary of the definitions of Karp-Rabin fingerprints and 1-mismatch sketches. As for Property (ii), note that we need to compute the Karp-Rabin fingerprints of the at most q concatenations of pairs of strings, similarly for Property (iii) we only need to compute the Karp-Rabin fingerprints of the at most q strings constructed from Y given the Karp-Rabin fingerprints of the at most q strings constructed from X and their concatenations (in Z). Finally, Property (iv) follows from Property (ii) as we can compute the 1-mismatch sketch of a square of any string given its 1-mismatch sketch in O(q) time.

B

Randomised reduction from k-Mismatch with Data Recovery to 1-Mismatch with Data Recovery

We start by filtering out the locations where the Hamming distance is large. To this end we use a randomised algorithm for the following 2-approximate k-mismatch problem. Let H be the true Hamming distance at a particular alignment of the pattern and the text. The algorithm outputs “Yes” if H ≤ k, either “Yes” or “No” if k < H ≤ 2k, and “No” if H > 2k. Lemma B.1 ([10]). Given a pattern P of length m and a streaming text arriving one symbol at a time, there is a randomised O(k2 log6 m)-space algorithm which takes O(log5 m) worst-case time per arriving symbol and solves the 2-approximate k-mismatch problem. The probability of error is at most 1/m2 . Next, we consider ℓ = ⌊log m⌋ partitions of the pattern into equispaced subpatterns. More precisely, we select ℓ random primes from the interval [k log2 m, 34k log2 m]. For each prime pi the pattern P is then partitioned into pi subpatterns Pi,j = P [j]P [pi + j]P [2pi + j] . . ., where j = 1, . . . , pi . Consider a particular location q in the text and the set of all mismatches between P and T [q − m + 1, q]. Let us call a mismatch isolated if it is the only mismatch between some subpattern Pi,j and the corresponding subsequence of T [q − m + 1, q], and let Mq be the set of all isolated mismatches at an alignment q. Lemma B.2 ([10]). If |Mq | ≤ 2k, the set Mq contains all mismatches between P and T [q −m+1, q] with probability at least 1 − 1/m2 . To finalize the reduction we partition the text T into pi equispaced substreams Ti,j , where j = 1,P . . . , pi , for each prime pi . We run the 1-Mismatch with Data Recovery algorithm for all i p2i = O(k2 log5 m) subpattern/substream pairs (Pi,j1 , Ti,j2 ). At each location where the algorithm for the 2-approximate problem outputs “Yes”, we retrieve all isolated mismatches and the data using the appropriate pi instances of the 1-Mismatch with Data Recovery problem for each i. This completes the description of our solution to the k-Mismatch with Data Recovery problem. In total, we use O(k2 log10 m/ log log m) space. To analyse the time complexity, note that when a new symbol of T arrives, we need to send it to O(log m) substreams only (one for each prime) and run the next step for each of the O(k log2 m) instances of the 1-Mismatch with Data Recovery problem on each of these substreams, which requires O(k log8 m/ log log m) time per symbol.

C

Proof of Lemma 3.1

We consider three cases based on the relationship between ℓ and the positions ℓaj .

13

Case 1: ℓ < ℓ1j or ℓaj + 2i−2 ≤ ℓ < ℓa+1 for some a ∈ {1, 2, 3}. In this case ℓ cannot be preceded j by a 1-mismatch occurrence of Pi by definition. Case 2: ℓaj ≤ ℓ < ℓaj + 2i−2 . Below we explain how to compute the 1-mismatch sketches of T [ℓ − 2i , ℓ − 1] in O(log3 m/ log log m) time. The algorithm can then use these sketches and the ones of Pi to determine in O(log2 m/ log log m) time whether ℓ is preceded by a right-half 1-mismatch occurrence of Pi . First, the algorithm computes the 1-mismatch sketches of T [1, ℓ − 2i − 1] by extensively using Lemma A.2. Recall that ℓaj is preceded by a right-half 1-mismatch occurrence of Pi . If the same property holds for ℓ, then T [ℓaj −2i , ℓaj −2i−1 −1] and T [ℓ−2i , ℓ−2i−1 −1] are two occurrences of Pi [1, 2i−1 ] that overlap by at least 2i−2 positions. Therefore T [ℓaj − 2i , ℓ − 2i − 1] is a power of the minimal period of Pi [1, 2i−1 ] by Fine and Wilf’s periodicity lemma [12]. The algorithm stores the 1-mismatch sketches of T [1, ℓaj − 2i − 1] and it can compute the 1-mismatch sketches of T [ℓaj − 2i , ℓ − 2i − 1] from the 1-mismatch sketches of the minimal period of Pi [1, 2i−1 ] in O(log3 m/ log log m) time. Therefore, it can compute the 1-mismatch sketches of T [1, ℓ − 2i − 1] in O(log3 m/ log log m) time. Second, the algorithm computes the 1-mismatch sketches of T [1, ℓ − 1] using the 1-mismatch sketches of T [1, ℓj − 1] and of the minimal period of Si [1, 2j ] in O(log3 m/ log log m) time. Finally, the algorithm computes the 1-mismatch sketches of T [ℓ − 2i , ℓ − 1] in O(log2 m/ log log m) time using the 1-mismatch sketches of T [1, ℓ − 2i − 1] and T [1, ℓ − 1]. Case 3: ℓ ≥ ℓ1j + 2i . In this case T [ℓ − 2i , ℓ − 1] is a suffix of T [ℓ1j , ℓ − 1], which is a power of the minimal period of Si [1, 2j ], that is, Si [1, ρj ]. As we know the 1-mismatch sketches of T [1, ℓ1j − 1], Si [1, ρj ], and Si [1, δj ] (recall that δj is the remainder of −2i modulo ρj ), the algorithm can use Lemma A.2 to compute the 1-mismatch sketches of T [1, ℓ − 1] (by extending T [1, ℓ1j − 1] with a power of Si [1, ρj ]) and T [ℓ − 2i , ℓ − 2i − 1] (by subtracting the sketches of Si [1, δj ] from the sketches of a power of Si [1, ρj ]) in O(log3 m/ log log m) time. It can then use them to determine whether T [ℓ − 2i , ℓ − 1] is a 1-mismatch occurrence of Pi .

D

Proof of Lemma 2.5

Lemma 2.5 gives a (1 − ε)-approximation algorithm for the Sliding Window Product problem log z ) space and O(1) time per element arrival. Let {xi }∞ that takes O( log(1/(1−ε)) i=1 be a number stream with xi ∈ [0, 1]. When xq for q ≥ m arrives, the algorithm must output either “No” or a number y. If the algorithm outputs “No”, then xq−m+1 · xq−m+2 · . . . · xq < 1/z. If the algorithm outputs a number y, then it is a (1 − ε)-approximation of xq−m+1 · xq−m+2 · . . . · xq .  At time q the algorithm maintains a family of at most M (z) = 2 log1−ε ((1 − ε)/z) intervals [i1 , i2 − 1], [i2 , i3 − 1], . . . , [ik(q) , q], such that 1 ≤ i1 < i2 < . . . < ik(q) < ik(q)+1 = q + 1, which obeys the following invariant. (i) All intervals [ij , ij+1 − 1] for j ≥ 2 are subintervals of [q − m + 1, q], and [i1 , i2 − 1] has a non-empty intersection with [q − m + 1, q]; (ii) If xij < 1 − ε, then ij+1 = ij + 1; (iii) Otherwise, xij · xij +1 · . . . · xij+1 −1 ≥ 1 − ε and either j = k(q) or xij · xij +1 · . . . · xij+1 < 1 − ε. Observation D.1. If k(q) = M (z), the total product of numbers in all intervals is at most (1−ε)/z. Proof. This is because the product of numbers in each two consecutive intervals is at most 1−ε. 14

The algorithm stores the product of numbers in each interval and the total product of numbers in all intervals. When xq+1 arrives, it updates the family in the following way. Let π be the product of numbers in the interval [ik(q) , q]. If π · xq+1 ≥ 1 − ε, then the algorithm extends the interval [ik(q) , q] by the element xq+1 . Otherwise, it creates a new interval [q + 1, q + 1]. If the number of intervals becomes larger than M (z) or if i2 ≤ (q + 1) − m + 1, the algorithm deletes the leftmost interval [i1 , i2 − 1]. Finally it updates the total product of numbers in the intervals. Let us now explain how the algorithm exploits the intervals. By Observation D.1, if q − m + 1 < i1 , then the product of the last m numbers is at most (1 − ε)/z and the algorithm outputs “No”. Otherwise from the invariant it follows that q − m + 1 ∈ [i1 , i2 − 1]. In this case we output the total product of the numbers in the intervals. Recall that if [i1 , i2 − 1] is not a singleton interval, then the product of numbers in it is at least 1 − ε. Therefore, the answer will be a (1 − ε)-approximation of xq−m+1 · xq−m+2 · . . . · xq . k j The space complexity follows from the fact that M (z) =

15

2 log z log(1/(1−ε))

.