Longest common substring with approximately $ k $ mismatches

Longest common substring with approximately k mismatches Tomasz Kociumaka1 , Jakub Radoszewski1 , and Tatiana Starikovskaya2

arXiv:1712.08573v1 [cs.DS] 22 Dec 2017

1

Institute of Informatics, University of Warsaw, Warsaw, Poland, [kociumaka,jrad]@mimuw.edu.pl

2´

Ecole Normale Sup´erieure, Paris, France, [email protected] Abstract

In the longest common substring problem we are given two strings of length n and must find a substring of maximal length that occurs in both strings. It is well-known that the problem can be solved in linear time, but the solution is not robust and can vary greatly when the input strings are changed even by one letter. To circumvent this, Leimeister and Morgenstern introduced the problem of the longest common substring with k mismatches. Lately, this problem has received a lot of attention in the literature. In this paper we first show a conditional lower bound based on the SETH hypothesis implying that there is little hope to improve existing solutions. We then introduce a new but closely related problem of the longest common substring with approximately k mismatches and use computational geometry techniques to show that it admits a solution with strongly subquadratic running time. We also apply these results to obtain a strongly subquadratic approximation algorithm for the longest common substring with k mismatches problem and show conditional hardness of improving its approximation ratio.

1

Introduction

Understanding how similar two strings are and what they share in common is a central task in stringology, the significance of which is witnessed for example by the 50,000+ citations of the paper introducing BLAST [3], a heuristic algorithmic tool for comparing biological sequences. This task can be formalised in many different ways, from the longest common substring problem to the edit distance problem. The longest common substring problem can be solved in optimal linear time and space, while the best known algorithms for the edit distance problem require n2−o(1) time, which makes the longest common substring problem an attractive choice for many practical applications. On the other hand, the longest common substring problem is not robust and its solution can vary greatly when the input strings are changed even by one letter. To overcome this issue, recently a new problem has been introduced called the longest common substring with k mismatches. In this paper we continue this line of research.

1.1

Related work

Let us start with a precise statement of the longest common substring problem. Problem 1 (Longest Common Substring). Given two strings T1 , T2 of length n, find a maximal-length substring of T1 that occurs in T2 . The suffix tree of T1 and T2 , a data structure containing all suffixes of T1 and T2 , allows to solve this problem in linear time and space [30, 18, 20], which is optimal as any algorithm needs Ω(n) time to read and Ω(n) space to store the strings. However, if we only account for “additional” space, the space the algorithm uses apart from the space required to store the input, then the suffix tree-based solution is not optimal and has been improved in a series of publications [6, 24, 28]. The major disadvantage of the longest common substring problem is that its solution is not robust. Consider, for example, two pairs of strings: a2m+1 , a2m b and am bam , a2m b. The longest common substring 1

of the first pair of strings is twice as long as the longest common substring of the second pair of strings, although we changed only one letter. This makes the longest common substring unsuitable to be used as a measure of similarity of two strings: Intuitively, changing one letter must not change the measure of similarity much. To overcome this issue, it is natural to allow the substring to occur in T1 and T2 not exactly but with a small number of mismatches. Problem 2 (Longest Common Substring with k Mismatches). Given two strings T1 , T2 of length n and an integer k, find a maximal-length substring of T1 that occurs in T2 with at most k mismatches. The problem can be solved in quadratic time and space by a dynamic programming algorithm, but more efficient solutions have also been shown. The longest common substring with one mismatch problem was first considered in [7], where an O(n2 )-time and O(n)-space solution was given. This result was further improved by Flouri et al. [15] who showed an O(n log n)-time and O(n)-space solution. For a general value of k, the problem was first considered by Leimeister and Morgenstern [27] who suggested a greedy heuristic algorithm. Flouri et al. [15] showed that Longest Common Substring with k Mismatches admits a quadratic time and constant (additional) space algorithm. Grabowski [17] presented two output-dependent algorithms with running times O(n((k + 1)(ℓ0 + 1))k ) and O(n2 k/ℓk ), where ℓ0 is the length of the longest common substring of T1 and T2 and ℓk is the length of the longest common substring with k mismatches of T1 and T2 . Aluru et al. [4] gave an O(2k n)-space, O(n(2 log n)k+1 )-time algorithm. Their algorithm can output longest common substring with k ′ mismatches for all k ′ ≤ k using the same amount of time and space. Finally, Abboud et al. [1] applied the polynomial method to develop a √ k 1.5 n2 /2Ω(

1.2

log n k )

-time randomised solution to the problem.

Our contribution

Our contribution is as follows. In Section 2 we show that existence of a strongly subqudratic time algorithm for Longest Common Substring with k Mismatches on strings over binary alphabet for k = Ω(log n) refutes the Strong Exponential Time Hypothesis (SETH) of Impagliazzo, Paturi and Zane [8, 11]: Hypothesis (SETH). For every δ > 0 there exists an integer m such that SAT on m-CNF formulas on n variables cannot be solved in mO(1) 2(1−δ)n time. This conditional lower bound implies that there is little hope to improve existing solutions to Longest Common Substring with k Mismatches. To this end, we introduce a new problem, inspired by the work of Andoni and Indyk [5]. Problem 3 (Longest Common Substring with Approximately k Mismatches). Two strings T1 , T2 of length n, an integer k, and a constant ε > 0 are given. If ℓk is the length of the longest common substring with k mismatches of T1 and T2 , return a substring of T1 of length at least ℓk that occurs in T2 with at most (1 + ε) · k mismatches. Let dH (S1 , S2 ) denote the Hamming distance between equal-length strings S1 and S2 , that is, the number of mismatches between them. Then we are to find the substrings S1 and S2 of T1 and T2 , respectively, of length ℓk such that dH (S1 , S2 ) ≤ (1 + ε) · k. Although the problem statement is not standard, it makes perfect sense from the practical point of view. It is also more rubust than the Longest Common Substring with k Mismatches problem, as for most applications it is not important whether a returned substring occurs in T1 and T2 with, for example, 10 or 12 mismatches. The result is also important from the theoretical point of view as it improves our understanding of the big picture of string comparison. In their work Andoni and Indyk used the technique of locality-sensitive hashing to develop a space-efficient randomised index for a variant of the approximate pattern matching problem. We extend their work with new ideas in the construction and the analysis to develop a randomised subquadratic-time solution to Problem 3. We state our results for strings over a constant-sized alphabet. This result is presented in Section 3.

2

In Section 4 we consider approximation algorithms for the length of the Longest Common Substring with k Mismatches. By applying the previous techniques, we show a strongly subquadratic 2-approximation algorithm and show that no strongly subquadratic (2 − ε)-approximation algorithm exists for any ε > 0 unless SETH fails. Finally, in Section 5 we show a strongly subcubic solution for Longest Common Substring with k Mismatches for all k by reducing it (for arbitrary alphabet size) to Binary Jumbled Indexing. Namely, we show that Longest Common Substring with k Mismatches for all k = 1, . . . , n can be solved in O(n2.589 ) time improving upon naive computation performed for every k separately. This is a full and extended version of the conference paper [29].

2

Longest Common Substring with k Mismatches is SETH-hard

Recall that the Hamming distance of two strings U and V of the same length, denoted as dH (U, V ), is simply the number of mismatches. Our proof is based on conditional hardness of the following problem. Problem 4 (Orthogonal Vectors). Given a set A of N vectors from {0, 1}d each, does there exist a P pair of vectors U, V ∈ A that is orthogonal, i.e., dh=1 U [h]V [h] = 0? The following result was shown by Williams [31].

Fact 1 ([31]). Suppose there is ε > 0 such that for all constant c, Orthogonal Vectors can be solved in 2o(d) · N 2−ε time on a set of N vectors of dimension d = c log N . Then SETH is false. We treat vectors from {0, 1}d as binary sequences of length d. Let us introduce two morphisms, µ and τ : µ(0) = 011 1000,

µ(1) = 000 1000,

τ (0) = 001 1000,

τ (1) = 111 1000.

We will use the following two observations. Observation 1. dH (µ(0), τ (0)) = dH (µ(0), τ (1)) = dH (µ(1), τ (0)) = 1 and dH (µ(1), τ (1)) = 3. Observation 2. Let x, y, z ∈ {0, 1}. Then the string 1000 has exactly two occurrences in 1000xyz1000. Let us also introduce a string gadget H = γ d , where γ = 100 1000. Note that γ 6= µ(x) and γ 6= τ (x) for x ∈ {0, 1}. Further note that |H| = |µ(U )| = |τ (U )| = 7d for any U ∈ A. Lemma 1. Consider a set of vectors A = {U1 , . . . , UN } from {0, 1}d, the strings: T1 = H q µ(U1 )H q . . . µ(UN )H q ,

T2 = H q τ (U1 )H q . . . τ (UN )H q

for some positive integer q, and k = d. Then: (a) If the set A contains two orthogonal vectors, then the Longest Common Substring with k Mismatches problem for T1 and T2 has a solution of length at least ℓ1 = (14q + 7)d. (b) If the set A does not contain two orthogonal vectors, then all the solutions for the Longest Common Substring with k Mismatches problem for T1 and T2 have length smaller than ℓ2 = (7q + 14)d. Proof. (a) Assume that Ui and Uj are a pair of orthogonal vectors. T1 contains a substring H q µ(Ui )H q and T2 contains a substring H q τ (Uj )H q . Both substrings have length ℓ1 and, by Observation 1, their Hamming distance is exactly k = d. (b) Assume to the contrary that there are indices a and b for which the substrings S1 = T1 [a, a + ℓ2 − 1] and S2 = T2 [b, b + ℓ2 − 1] have at most k mismatches. First, let us note that 7 | a − b. Indeed, otherwise S1 would contain at least ⌊(ℓ2 − 6)/7⌋ = (q + 2)k − 1 substrings of the form 1000 which, by Observation 2, would not be aligned with substrings 1000 in S2 . Hence, they would account for more than k mismatches between S1 and S2 . 3

Let us call all the substrings of T1 and T2 that come from the 3-letter prefixes of µ(0), µ(1), τ (0), τ (1), and γ the core substrings, with core substrings that come from γ being gadget core substrings. We have already established that the core substrings of S1 and S2 are aligned. Moreover, S1 and S2 contain at least ⌊(ℓ2 − 6)/7⌋ = (q + 2)k − 1 core substrings each. Amongst every (q + 2)k − 1 consecutive core substrings in S1 , some k consecutive must come from µ(Ui ) for some index i; a symmetric property holds for S2 and τ (Uj ). Moreover, as only the gadget core substrings in S1 and S2 can match exactly, at most k core substrings that are contained in S1 and S2 can be non-gadget. Hence, S1 and S2 contain exactly k non-gadget core substrings each. If they were not aligned, they would have produced more than k mismatches in total with the gadget core substrings. This concludes that S1 and S2 must contain, as aligned substrings, µ(Ui )[1, 7d − 4] and τ (Uj )[1, 7d − 4] for some i, j ∈ {1, . . . , N }, respectively. Hence, dH (Ui , Uj ) ≤ k which, by Observation 1, concludes that Ui and Uj are orthogonal. Theorem 1. Suppose there is ε > 0 such that Longest Common Substring with k Mismatches can be solved in O(n2−ε ) time on strings over binary alphabet for k = Ω(log n). Then SETH is false. Proof. The reduction of Lemma 1 with q = 1 constructs, for an instance of the Orthogonal Vectors problem with N vectors of dimension d, an equivalent instance of the Longest Common Substring with k Mismatches problem with strings of length n = 7d(2N + 1) and k = d. Thus, assuming that Longest Common Substring with k Mismatches can be solved in O(n2−ε ) time for k = Ω(log n), the constructed instance can be solved in O(N 2−ε dO(1) ) time if d = c log N . This, by Fact 1, contradicts SETH.

3

Longest Common Substring with Approximately k Mismatches

In this section we show prove the following theorem. Theorem 2. Assume binary alphabet and let 0 < ε < 2 be an arbitrary constant. Longest Common Substring with Approximately k Mismatches can be solved in O(n1+1/(1+ε) ) space and O(n1+1/(1+ε) log2 n) time correctly with constant probability. If the alphabet is of constant size σ > 2, we can use a standard trick and encode T1 and T2 by replacing each letter a in them with a binary vector 0a−1 10σ−a . We will only consider substrings of T1 and T2 that start at positions giving remainder 1 modulo σ. The Hamming distance between any two such substrings of T1 and T2 in the encoded form will be exactly twice as large as the Hamming distance between the corresponding original substrings, which allows to extend our solution naturally to this case as well at a cost of an additional constant factor.

3.1

Overview of the proof

The classic solution to the longest common substring problem is based on two observations. The first observation is that the longest common substring of T1 and T2 is in fact the longest common prefix of some suffix of T1 and some suffix of T2 . The second observation is that the maximal length of the longest common prefix of a fixed suffix S of T1 and suffixes of T2 is reached on the two suffixes of T2 that are closest to S in the lexicographic order. This suggests the following algorithm: First, build a suffix tree of T1 and T2 , which contains all suffixes of T1 and T2 and orders them lexicographically. Second, compute the longest common prefix of each suffix of T1 and the two suffixes of T2 closest to S in the lexicographic order, one from the left and one from the right. The problem of computing the longest common prefix has been extensively studied in the literature and a number of very efficient deterministic and randomised solutions exist [9, 10, 13, 21, 19], for example, one can use a Lowest Common Ancestor (LCA) data structure, which can be constructed in linear time and space and answers longest common prefix queries in O(1) time [13, 19]. Our solution to the longest common substring with approximately k mismatches problem is somewhat similar. We will consider Θ(n1/(1+ε) log n) orderings on the suffixes of T1 and T2 and will show that with high probability the length of the longest common substring with approximately k mismatches is the answer 4

to a longest common prefix with approximately k mismatches (LCPk˜ ) query for some pair of suffixes that are close to each other in one of the orderings. In an LCPk˜ query we are given two suffixes S1 , S2 of T1 and T2 , respectively, and must output any integer in the interval [ℓk , ℓ(1+ε)·k ], where ℓk and ℓ(1+ε)·k are the lengths of the longest common prefixes of S1 and S2 with at most k and at most (1 + ε) · k mismatches, respectively. Note that LCPk˜ queries can be answered deterministically in O(k) time using the kangaroo method [26, 16], but for the purposes of this paper we give a faster randomised solution. Lemma 2. After O(n log3 n) time and O(n log2 n) space preprocessing of T1 and T2 , an LCPk˜ query can be answered in O(log2 n) time. The answer is correct with probability at least 1 − 1/n2 . The key idea is to compute sketches for all power-of-two length substrings of T1 and T2 . The sketches will have logarithmic length (i.e., we will be able to compare them very fast) and the Hamming distance between them will be roughly equal to the Hamming distance between the original substrings. Once the sketches are computed, we can use binary search to answer LCPk˜ queries in polylogarithmic time. To define the orderings on suffixes of T1 and T2 we will use the locality-sensitive hashing technique, which was initially introduced for the needs of computational geometry [22]. In more detail, we will choose Θ(n1/(1+ε) log n) hash functions, where each function can be considered as a projection of a string of length n onto a random subset of its positions. By choosing the size of the subset appropriately, we will be able to guarantee that the hash function is locality-sensitive: For any two strings at the Hamming distance at most k, the values of the hash functions on them will be equal with high probability, while the values of the hash functions on any pair of strings at the Hamming distance bigger than (1 + ε) · k will be different with high probability. For each hash function we will sort the suffixes of T1 and T2 by the lexicographic order on their hash values. As a corollary of the locality-sensitive property, if two suffixes of T1 and T2 have a long common prefix with at most k mismatches, with high probability they will be close to each other in the ordering.

3.2

Proof of Lemma 2

During the preprocessing stage, we compute sketches [25] of all substrings of the strings T1 and T2 of lengths ℓ = 1, 2, . . . , 2⌊log n⌋ , which can be defined in the following way. For a fixed ℓ choose λ = ⌈3 ln n/γ 2 ⌉ binary vectors rℓi of length ℓ, where γ is a constant to be defined later, such that ( 1 1 with probability 4k i rℓ [j] = 1 0 with probability 1 − 4k for all i = 1, 2, . . . , λ and j = 1, 2, . . . , ℓ. For a string x of length ℓ ∈ {1, 2, . . . , 2⌊log n⌋ } we define a sketch sk(x) to be a vector of length λ, where sk(x)[i] = rℓi · x (mod 2). In other words, to obtain sk(x) we sample 1 and then sum the letters in the sampled positions modulo 2. each position of x with probability 4k All sketches can be computed in O(n log3 n) time by independently running the Fast Fourier Transform (FFT) algorithm for each of the vectors rℓi and T1 and T2 [14], and occupy O(n log2 n) space. Each substring S can be decomposed uniquely asPx1 x2 . . . xr , where r = O(log n) and |x1 | > |x2 | > . . . > |xr | are powers of 1 k 1 (1+ε)·k two. We define a sketch sk(S) = q sk(xq ) (mod 2). Let δ1 = 21 (1 − (1 − 2k ) ), δ2 = 12 (1 − (1 − 2k ) ), δ1 +δ2 and ∆ = 2 · λ. Lemma 3 ([25]). For any i, if dH (S1 , S2 ) ≤ k, then sk(S1 )[i] 6= sk(S2 )[i] with probability at most δ1 . If dH (S1 , S2 ) ≥ (1 + ε) · k, then sk(S1 )[i] 6= sk(S2 )[i] with probability at least δ2 . Proof. We use a different definition of rℓi that gives the same distribution: first, we sample positions with 1 , and then for each sampled position we decide if it equals one independently and uniformly probability 2k at random. Let m = dH (S1 , S2 ) and p1 , . . . , pm be the positions of the mismatches between the two strings. If none of the positions p1 , p2 , . . . , pm are sampled, then sk(S1 )[i] = sk(S2 )[i], and otherwise for each way of sampling p1 , p2 , . . . , pm−1 exactly one of the two choices for pm will give sk(S1 )[i] = sk(S2 )[i]. (Recall that 1 m the alphabet is binary.) Hence, the probability that sk(S1 )[i] 6= sk(S2 )[i] is equal to 21 (1 − (1 − 2k ) ), which is at most δ1 if dH (S1 , S2 ) ≤ k, and at least δ2 if the Hamming distance is at least (1 + ε) · k. 5

Lemma 4. If dH (sk(S1 ), sk(S2 )) > ∆, then dH (S1 , S2 ) > k. Moreover, if dH (sk(S1 ), sk(S2 )) ≤ ∆, then dH (S1 , S2 ) < (1 + ε) · k. Both claims are correct with probability at least 1 − n13 . Proof. Let χi be an indicator random variable that is equal to one if and only if sk(S1 )[i] 6= sk(S2 )[i]. The claim follows immediately from Lemma 3 and the following Chernoff bounds (see [2, Appendix A]). Indeed, for λ independently and identically distributed binary variables χ1 , χ2 , . . . , χλ , we have # # " λ " λ X X 2 2 χi ≤ (p − γ)λ ≤ e−λγ χi > (p + γ)λ ≤ e−2λγ , Pr Pr i=1

i=1

2

1 where p = Pr[χi = 1]. We put γ = δ2 −δ and obtain that the error probability is at most e−2λγ ≤ n13 . 2 1 k 1 k (From the fact that (1 − 2k ) is increasing and that lim inf k→∞ (1 − 2k ) = e−1/2 , γ = Θ(1 − e−ε/2 ) as a function of k; in other words, if ε is a constant, γ is a constant as well.) If dH (S1 , S2 ) ≤ k, Lemma 3 asserts that p ≤ δ1 . By the first of the above inequalities, we have that dH (sk(S1 ), sk(S2 )) ≤ ∆ with probability at least 1− n13 . Hence, if dH (sk(S1 ), sk(S2 )) > ∆, then dH (S1 , S2 ) > k with the same probability. If dH (S1 , S2 ) ≥ (1 + ε) · k, Lemma 3 asserts that p ≥ δ2 . By the second inequality, we have that dH (sk(S1 ), sk(S2 )) > ∆ with probability at least 1− n13 . Hence, if dH (sk(S1 ), sk(S2 )) ≤ ∆, then dH (S1 , S2 ) < (1 + ε) · k with the same probability.

Suppose we wish to answer an LCPk˜ query on two suffixes S1 , S2 . It suffices to find the longest prefixes of S1 , S2 such that the Hamming distance between their sketches is at most ∆. As mentioned above, these prefixes can be represented uniquely as a concatenation of strings of power-of-two lengths ℓ1 > ℓ2 > . . . > ℓr . To compute ℓ1 , we initialise it with the biggest power of two not exceeding n and compute the Hamming distance between the sketches of corresponding substrings. If it is smaller than ∆, we have found ℓ1 , otherwise we divide ℓ1 by two and continue. Suppose that we already know ℓ1 , ℓ2 , . . . , ℓi and let hi be the Hamming distance between the sketches of prefixes of S1 and S2 of lengths ℓ1 +ℓ2 +. . .+ℓi . To compute ℓi+1 , we initialise it with ℓi and then divide it by two until the Hamming distance between the corresponding substrings of length ℓi+1 is at most ∆ − hi . From above it follows that the algorithm is correct with probability at least 1 − n12 (we estimate error probability by the union bound) and that the query time is O(log2 n). This completes the proof of Lemma 2.

3.3

Proof of Theorem 2

We start by preprocessing T1 and T2 as described in Lemma 2. The main phase of the algorithm is defined by three parameters t, w, and m to be specified later, and consists of Θ(t! log n) independent steps. At each step we choose wt hash functions, where each hash function can be considered as a t-tuple of projections of strings of length n onto subsets of their positions of size m. Let H be a set of all projections of strings of length n onto a single position, i.e. the value of the i-th projection on a string of length n is simply its i-th letter. We start by choosing a set of w functions ur ∈ Hm , r = 1, 2, . . . , w, uniformly at random. Each hash function h is defined to be a t-tuple of distinct functions ur . More formally, h = (ur1 , ur2 , . . . , urt ) ∈ Hmt , where 1 ≤ r1 < r2 < . . . < rt ≤ w. The fact that the hash functions are constructed from a small set of functions uj will ensure faster running time for the algorithm. For a string S of length n and a function g ∈ Hq projecting strings into positions 1 ≤ ai ≤ a2 ≤ . . . ≤ aq ≤ n, we define g(S) as S[a1 ]S[a2 ] . . . S[aq ]. Consider the set of all suffixes S1 , S2 , . . . , S2n of T1 and T2 . We append each suffix in the set with an appropriate number of letters $ ∈ / Σ so that all suffixes have length n and build a (compact) trie on strings h(S1 ), h(S2 ), . . . , h(S2n ). Recall that a compact trie stores only explicit nodes, that is, the root, the leaves, and nodes with at least two children. Its size is linear in the number of strings that are stored. Theorem 3. After O(wn4/3 log4/3 n)-time and O(wn)-space preprocessing of functions ur , r = 1, 2, . . . , w, for any hash function h = (ur1 , ur2 , . . . , urt ) a trie on h(S1 ), h(S2 ), . . . , h(S2n ) can be built in O(tn log n) time and O(n) space. 6

Algorithm 1 Longest common substring with approximately k mismatches. 1: Preprocess T1 , T2 for LCPk ˜ queries 2: for i = 1, 2, . . . , Θ(t! log n) do 3: for r = 1, 2, . . . , w do 4: Choose a function ur ∈ Hm uniformly at random 5: Preprocess ur 6: end for 7: for all h = (ur1 , ur2 , . . . , urt ) do 8: Build a trie on h(S1 ), h(S2 ), . . . , h(S2n ) 9: Augment the trie with an LCA data structure 10: end for 11: for all suffixes S of T1 do 12: Find the largest ℓ such that the total size of the ℓ-neighbourhoods of S is ≥ 2 13: Select a multiset S of 2 wt suffixes from the ℓ-neighbourhoods of S 14: for all suffixes S ′ ∈ S do 15: Compute LCPk˜ (S, S ′ ) and update the answer 16: end for 17: end for 18: end for

w t

Let us defer the proof of the theorem until we complete the description of the algorithm and show Theorem 2. The algorithm preprocesses u1 , u2 , . . . , uw , and for each hash function h builds a trie on h(S1 ), h(S2 ), . . . , h(S2n ). It then augments the trie with an LCA data structure, which can be done in linear time and space [13, 19]. Given two strings h(Si ), h(Sj ) the LCA data structure can find the length of their longest common prefix in constant time. Consider a suffix S of T1 and let h be a hash function projecting a string of length n onto a subset P ⊆ [1, n] of its positions. We define h [ℓ] to be a projection onto a subset P ∩ [1, ℓ] of positions. (In other words, h [ℓ] is a function h applied to a prefix of a string of length ℓ.) We further say that the ℓ-neighbourhood of S is the set of all suffixes S ′ of T2 such that h [ℓ] (S) = h [ℓ] (S ′ ). Note that for a fixed h and ℓ the size of the ℓ-neighbourhood of S is equal to the number of suffixes S ′ of T2 such that the length of the longest common prefix of h(S) and h(S ′ ) is at least |P ∩ [1, ℓ]|. From the properties of the lexicographic order it follows that if S ′ , S ′′ are two suffixes of T2 and h(S ′′ ) is located between h(S ′ ) and h(S) in the trie for h, then the longest common prefix of h(S ′ ) and h(S) is no longer than the longest common prefix of h(S ′′ ) and h(S). Therefore, the larger ℓ is, the smaller the neighbourhood is. We use a simple binary search and the LCA data structures to find the largest ℓ such that the total size of ℓ-neighbourhoods of S for all hash functionsis at least 2 wt in O( wt ·log2 n) time. From the union of the ℓ-neighbourhoods we select a multiset S of 2 wt suffixes ensuring that all suffixes S ′ such that the longest common prefix of h(S ′ ) and h(S) has length at least |P ∩ [1, ℓ]| + 1 are included. For each suffix S ′ ∈ S we compute the longest common prefix with approximately k mismatches of S ′ and S by one LCPk˜ query (Lemma 2). The longest of all retrieved prefixes, over all suffixes S of T1 , is returned as an answer. The algorithm is summarised in the figure above. We will now proceed to its complexity and correctness.

3.4

Complexity and correctness

To ensure the complexity bounds and correctness of the algorithm, we must carefully choose the parameters t, w, and m. Let p1 = 1 − k/n and p2 = 1 − (1 + ε) · k/n. The intuition behind p1 and p2 is that if S1 , S2 are two suffixes appended to length n and the Hamming distance between S1 and S2 is at most k, then p1 is a lower bound for the probability of two letters S1 [i], S2 [i] to be equal. On the other hand, p2 is an upper bound for the probability of two letters S1 [i], S2 [i] to be equal if the Hamming distance between S1 and S2 is at least (1 + ε) · k. Let ρ = log p1 / log p2 , and define 7

t=

&s

ρ log n ln log n

'

+ 1, w = ⌈n

ρ/t

1 1 ⌉, and m = logp2 2 . t n

We assume that n ≥ 7 (so that ln log n ≥ 1) and (1 + ε)k < n (so that p1 , p2 > 0). 3.4.1

Complexity

To show the complexity of the algorithm, we will start from the following simple observation. √ Observation 3. ρ ≤ 1/(1 + ε) and 2 ≤ t ≤ ⌈ log n⌉. Proof. By Bernoulli’s inequality (1 − k/n)1+ε ≥ 1 − (1 + ε) · k/n. Hence, we obtain that ρ = 1 1+ε . The second part of the observation follows. Lemma 5. One step of the algorithm takes O(wn4/3 log4/3 n + wt · n log2 n) time.

log(1−k/n) log(1−(1+ε)k/n)

≤

Proof. By Theorem 3, after O(wn4/3 log4/3 n)-time preprocessing we can build a trie and an LCA data structure on strings h(S1 ), h(S2 ), . . . , h(S2n ) for a hash function h in O(tn log n) = O(n log3/2 n) time and there are wt hash functions in total. For each suffix S of T1 we compute the set S in O( wt log2 n) time and then run 2 wt LCPk˜ queries, which takes O( wt log2 n) time as well by Lemma 2. Over all suffixes, this takes O( wt · n log2 n) time. Corollary 4. The running time of the algorithm is O(n1+1/(1+ε) log2 n).

Proof. Preprocessing T1 , T2 for LCPk˜ queries takes O(n log3 n) time (see Lemma 2). Each step of the algorithm takes O(wn4/3 log4/3 n + wt · n log2 n) time, and there are Θ(t! log n) steps in total. To estimate √ the total running time of the algorithm we notice that wt! = O(e ρ log n ln log n ) = no(1) and wt · t! ≤ (w−1)t t! < nρ ≤ n1/(1+ε) for t ≥ 3, whereas for t = 2 we obviously have wt · t! = O(nρ ) = O(n1/(1+ε) ). t! Plugging these inequalities into the time bound for one step and recalling that 0 < ε < 2, we obtain the claim. Lemma 6. The space complexity of the algorithm is O(n1+1/(1+ε) ). Proof. The data structure for LCPk˜ queries requires O(n log2 n) space. At each step of the algorithm, preprocessing functions uj requires O(wn) = O(n1+o(1) ) space and the tries occupy O( wt · n) = O(n1+1/(1+ε) ) space. 3.4.2

Correctness

Let S be a suffix of T1 . Consider a set of the longest common prefixes with k mismatches of S and suffixes of T2 and let ℓk be the maximal length of a prefix in this set achieved on some suffix S ′ of T2 . Lemma 7. For each step with probability ≥ Θ(1/t!) there exists a hash function h such that h [ℓ ] (S) = k h (S ′ ). [ℓk ]

Proof. Consider strings S[1, ℓk ]$n−ℓk and S ′ [1, ℓk ]$n−ℓk . The Hamming distance between them is equal to the Hamming distance between S[1, ℓk ] and S ′ [1, ℓk ], which is k. Moreover, for any hash function h we have that h(S[1, ℓk ]$n−ℓk ) = h(S ′ [1, ℓk ]$n−ℓk ) if and only if h [ℓ ] (S) = h [ℓ ] (S ′ ). Remember that each k

k

hash function is a t-tuple of functions uj . Consequently, if h(S[1, ℓk ]$n−ℓk ) 6= h(S ′ [1, ℓk ]$n−ℓk ) for all hash functions h, the strings collide on at most t − 1 functions uj . By [5, Lemma A.1] the probability of this event for two strings at the Hamming distance k is at most 1 − Θ(1/t!).

8

2 As a corollary, we can choose the constant in the number of steps so that with probability ≥ 1−1/n there will exist a step of the algorithm such that for at least one hash function we will have h [ℓ ] (S) = h [ℓ ] (S ′ ). k k The set S of 2 wt suffixes that we sample from the ℓ-neighbourhoods of S might or might not include the suffix S ′ . If it does, then the LCPk˜ query for S ′ and S will return a substring of length ≥ ℓk with high probability. If it does not, then by the definition of neighbourhoods for each suffix S ′′ ∈ S belonging to ′′ the neighbourhood for a hash function g we have g [ℓ ] (S) = g [ℓ ] (S ). We will show that only for a small k k number of such suffixes an LCPk˜ query can return a substring of length smaller than ℓk . Lemma 8. With probability ≥ 1 − 2/n2 there are at most wt suffixes S ′′ such that g [ℓ ] (S) = g [ℓ ] (S ′′ ) k k but the LCPk˜ (S, S ′′ ) query returns a substring shorter than ℓk .

Proof. If the LCPk˜ query for S and S ′′ returns a substring shorter than ℓk , then with probability at least 1 − n12 the Hamming distance between S[1, ℓk ] and S ′′ [1, ℓk ] is at least (1 + ε) · k. Remember that a hash function g can be considered as a projection onto a random subset of positions of size mt, and therefore we obtain Pr[g [ℓ ] (S) = g [ℓ ] (S ′′ )] = Pr[g(S[1, ℓk ]$n−ℓk ) = g(S ′′ [1, ℓk ]$n−ℓk )] ≤ (p2 )mt , k

k

1 where the latter is at most 1/n since mt ≥ log p2 n2′′. We can consider an indicator random variable equal ′′ to one if for a suffix S such that g [ℓ ] (S) = g [ℓ ] (S ) the LCPk˜ query returns a substring shorter than ℓk , k k and to zero otherwise. By linearity, the expectation of their sum is at most n22 · wt . The claim follows from Markov’s inequality. 2

From Lemmas 7, 8 and 2 it follows that the algorithm correctly finds the value of ℓ for the suffix S of T1 with error probability ≤ 3/n2 . Applying the union bound, we obtain that the error probability of the algorithm is constant.

3.5

Proof of Theorem 3

Recall that h is a t-tuple of functions ur , i.e. h = (ur1 , ur2 , . . . , urt ), where 1 ≤ r1 < r2 < . . . < rt ≤ w. Below we will show that after the preprocessing of functions ur we will be able to compute the longest common prefix of any two strings ur (Si ), ur (Sj ) in O(1) time. As a result, we will be able to compute the longest common prefix of h(Si ), h(Sj ) in O(t) time. It also follows that we will be able to compare any two strings h(Si ), h(Sj ) in O(t) time as their order is defined by the letter following the longest common prefix. Therefore, we can sort strings h(S1 ), h(S2 ), . . . , h(S2n ) in O(tn log n) time and O(n) space and then compute the longest common prefix of each two adjacent strings in O(tn) time. The trie on h(S1 ), h(S2 ), . . . , h(S2n ) can then be built in O(n) time by imitating its depth-first traverse. It remains to explain how we preprocess functions ur , r = 1, 2, . . . , w. For each function ur it suffices to build a trie on strings ur (S1 ), ur (S2 ), . . . , ur (S2n ) and to augment it with an LCA data structure [13, 19]. We will consider two different methods for constructing the trie with time dependent on m. No matter what the value of m is, one of these methods will have O(n4/3 log1/3 n) running time. Let ur be a projection along a subset P of positions 1 ≤ a1 ≤ a2 ≤ . . . ≤ am ≤ n and denote T = T1 $n T2 $n . √ Lemma 9. The trie on ur (S1 ), ur (S2 ), . . . , ur (S2n ) can be constructed in O( mn log n) time and O(n) space 3 correctly with error probability at most 1/n . √ Proof. W.l.o.g. assume that m is integer. We partition P into disjoint subsets B1 , B2 , . . . , B√m , where √ Bℓ = {aℓ,1 , aℓ,2 , . . . , aℓ,√m } = {a(ℓ−1)√m+q | q ∈ [1, m]}. √ Now ur can be represented as a m-tuple of projections b1 , b2 , . . . , b√m onto the subsets B1 , B2 , . . . , B√m respectively. We will build the trie by layers to avoid space overhead. Suppose that we have built the trie for a function (b1 , b2 , . . . , bℓ−1 ) and we want to extend it to the trie for (b1 , b2 , . . . , bℓ−1 , bℓ ). 9

√

Let p be a random prime of value Θ(n5 ). We create a vector χ of length n, where χ[aℓ,q ] = 2 m−q and zero for all positions not in Bℓ . We then run the FFT algorithm for χ and T in the field Zp [14]. The output of the FFT algorithm will contain convolutions of χ and all suffixes S1 , S2 , . . . , S2n . The convolution of χ and a suffix Si is the Karp-Rabin fingerprint [23] ϕℓ,i of bℓ (Si ), where √

ϕℓ,i =

m X q=1

Si [aℓ,q ] · 2

√

m−q

(mod p)

If the fingerprints of bℓ (Si ) and bℓ (Sj ) are equal, then bℓ (Si ) and bℓ (Sj ) are equal with probability at least 1 − 1/n5 , and otherwise they differ. For a fixed leaf of the trie for (b1 , b2 , . . . , bℓ−1 ) we sort all suffixes that end in it by fingerprints ϕℓ,i , which takes O(n log n) time in total. For each two suffixes Si , Sj that√end in the same leaf, are adjacent and have ϕℓ,i 6= ϕℓ,j , we compare bℓ (Si ) and bℓ (Sj ) letter-by-letter in O( m) time to find their longest common prefix. Note √ that this letter-by-letter comparison step will be executed at most n times, and therefore will take O( mn) time in total. We then append each leaf with a trie on strings bℓ (Si ) that can be built by imitating its depth-first √ traverse, which takes O(n) time for a layer. By the union bound, the error probability is at most n15 · n · m ≤ n13 . The second method builds the trie using the algorithm described in the first paragraph of this section; we only need to give a method for computing the longest common prefix of ur (Si ) and ur (Sj ) (or, equivalently, the first position where ur (Si ) and ur (Sj ) differ). The following lemma shows that this query can be answered in O(n log n/m) time, which gives O(n2 log2 n/m) time complexity of the trie construction. Lemma 10 ([5]). After O(n)-time and space preprocessing the first position where two strings ur (Si ) and ur (Sj ) differ can be found in O(n log n/m) time correctly with error probability at most 1/n3 . Proof. For m = O(log n) the conclusion is trivial. Assume otherwise. We start by building the suffix tree for the string T which takes O(n) time and space [30, 18]. Furthermore, we augment the suffix tree with an LCA data structure in O(n) time [13, 19]. Let ℓ = ⌈3n ln n/m⌉. We can find the first ℓ positions q1 < q2 < . . . < qℓ where Si and Sj differ in O(ℓ) = O(n log n/m) time using the kangaroo method [26, 16]. We set qi = ∞ if a given position does not exist. The idea of the kangaroo method is as follows. We can find q1 by one query to the LCA data structure in O(1) time. After removing the first q1 positions of Si , Sj , we obtain suffixes Si+q1 , Sj+q1 and find q2 by another query to the LCA data structure, and so on. If at least one of the positions q1 , q2 , . . . , qℓ belongs to P, then we return the first such position as an answer, and otherwise we say that ur (Si ) = ur (Sj ). The set P can be stored in a Boolean array so that testing if an element belongs to it can be done in constant time. Let us show that if p is the first position where ur (Si ) and ur (Sj ) differ, then p belongs to {q1 , q2 , . . . , qℓ } with high probability. Because q1 < q2 < . . . < qℓ are the first ℓ positions where Si and Sj differ, it suffices to show that at least one of these positions belongs to P. We rely on the fact that P is a random subset of [1, n]. We have Pr[q1 , . . . , qℓ ∈ / P] = (1 − ℓ/n)m ≤ (1 − 3 ln n/m)m ≤ e3 1ln n = 1/n3 . √ As a corollary of Lemmas 9 and 10, the trie on strings ur (S1 ), . . . , ur (S2n ) can be built in O(min{ m, n log n/m}· n log n) = O(n4/3 log4/3 n) time and O(n) space correctly with high probability which implies Theorem 3 as explained in the beginning of this section.

4

Approximate Longest Common Substring with k Mismatches

In this section we consider an approximate variant of the Longest Common Substring with k Mismatches problem, defined as follows. Problem 5 (Approximate Longest Common Substring with k Mismatches). Two strings T1 , T2 of length n, an integer k, and a constant z > 1 are given. If ℓk is the length of the longest common substring with k mismatches of T1 and T2 , return a substring of T1 of length at least ℓk /z that occurs in T2 with at most k mismatches. 10

Theorem 5. (a) Approximate Longest Common Substring with k Mismatches can be solved for z = 2 and a binary alphabet in O(n1.5 log2 n) time and O(n1.5 ) space. (b) Suppose there exists 0 < ε < 1 such that Approximate Longest Common Substring with k Mismatches can be solved for z = 2 − ε in O(n2−δ ) time, for some δ > 0. Then SETH is false. Proof. (a) The algorithm of Theorem 2 for ε = 1 computes a pair of substrings of length at least ℓk of T1 and T2 that have Hamming distance at most 2k. Either the first halves or the second halves of the strings have Hamming distance at most k. (b) We use the gap that exists in Lemma 1 for q > 1. Assume that there is such an algorithm for some ε and δ. We will run it for strings T1 and T2 from that lemma. Let q = ⌈ 3ε ⌉ − 2; then ℓ1 /ℓ2 ≥ 2 − ε. If the Orthogonal Vectors problem has a solution, by Lemma 1(a), the algorithm produces a longest common substring of length at least ℓ1 /(2 − ε) ≥ ℓ2 . Otherwise, by Lemma 1(b), its result has length smaller than ℓ2 . This concludes that the conjectured approximation algorithm can be used to solve the Orthogonal Vectors problem. The lengths of the selected strings are n = N (7dq + 7d) + 7dq = O(N d) for d = c log N . Hence, the running time is O(n2−δ ) = O(N 2−δ dO(1) ), which, by Fact 1, contradicts SETH.

5

Longest Common Substring with k Mismatches for all k

The following problem has received a considerable attention in the recent years; see [12] and the references therein. Problem 6 (Binary Jumbled Indexing). Construct a data structure over a binary string S of length n that, given positive integers ℓ and q, can compute if there is a substring of S of length ℓ containing exactly q ones. A simple combinatorial argument shows that it suffices to compute the minimal and maximal number of ones in a substring of S of length ℓ, as for every intermediate number of ones a substring of S of this length exists as well. This shows that the Binary Jumbled Indexing problem can be solved in linear space and with constant-time queries. It turns out that the index can also be constructed in strongly subquadratic time. Lemma 11 ([12]). The index for Binary Jumbled Indexing of O(n) size and with O(1)-time queries can be constructed in O(n1.859 ) time. We use this result to solve the Longest Common Substring with k Mismatches problem for all values of k simultaneously. Theorem 6. Longest Common Substring with k Mismatches for all k can be solved in O(n2.859 ) time. Proof. Note that, equivalently, we can compute, for all ℓ = 1, . . . , n, what is the minimal Hamming distance between substrings of length ℓ in T1 and T2 . Let M be an n × n Boolean matrix such that M [i, j] = 0 if and only if T1 [i] = T2 [j]. We construct 2n − 1 binary strings corresponding to the diagonals of M : the string number p, for p ∈ {−n, . . . , n}, corresponds to the diagonal M [i, j] : j − i = p. For each of the strings we construct the jumbled index using Lemma 11. Each diagonal corresponds to one of possible alignments of T1 and T2 . In the jumbled index we compute, in particular, for each value of ℓ what is the minimal number of 1s (which correspond to mismatches between the corresponding positions in T1 and T2 ) in a string of length ℓ. To compute the global minimum for a given ℓ we only need to take the minimum in all the jumbled indexes. By Lemma 11, all the jumbled indexes can be constructed in O(n2.859 ) time.

11

References [1] A. Abboud, R. Williams, and H. Yu. More applications of the polynomial method to algorithm design. In Proceedings of the 26th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 218–230, 2015. [2] N. Alon and J. Spencer. The Probabilistic Method. John Wiley and Sons, 1992. [3] S. Altschul, W. Gish, W. Miller, E. Myers, and D. J. Lipman. Basic local alignment search tool. Molecular Biology, 215(3):403–410, 1990. [4] S. Aluru, A. Apostolico, and S. Thankachan. Efficient alignment free sequence comparison with bounded mismatches. In Proceedings of the 19th Annual International Conference on Research in Computational Molecular Biology, pages 1–12, 2015. [5] A. Andoni and P. Indyk. Efficient algorithms for substring near neighbor problem. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1203–1212, 2006. [6] M. Babenko and T. Starikovskaya. Computing longest common substrings via suffix arrays. In Proceedings of the 3rd International Computer Science Symposium in Russia, volume 5010 of Lecture Notes in Computer Science, pages 64–75, 2008. [7] M. Babenko and T. Starikovskaya. Computing the longest common substring with one mismatch. Problems of Information Transmission, 47(1):28–33, 2011. [8] A. Backurs and P. Indyk. Edit distance cannot be computed in strongly subquadratic time (Unless SETH is false). In Proceedings of the 47th Annual ACM on Symposium on Theory of Computing (STOC), pages 51–58, 2015. [9] P. Bille, I. L. Gørtz, and J. Kristensen. Longest common extensions via fingerprinting. In Proceedings of the 6th International Conference on Language and Automata Theory and Applications, pages 119–130, 2012. [10] P. Bille, I. L. Gørtz, B. Sach, and H. W. Vildhøj. Time-space trade-offs for longest common extensions. Journal of Discrete Algorithms, 25:42–50, Mar. 2014. [11] K. Bringmann and M. Kunnemann. Quadratic conditional lower bounds for string problems and dynamic time warping. In Proceedings of the 56th Annual Symposium on Foundations of Computer Science (FOCS), pages 79–97, 2015. [12] T. M. Chan and M. Lewenstein. Clustered integer 3SUM via additive combinatorics. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing (STOC), pages 31–40, 2015. [13] J. Fischer and V. Heun. Theoretical and practical improvements on the RMQ-problem, with applications to LCA and LCE. In Proceedings of the 17th Annual Conference on Combinatorial Pattern Matching, pages 36–48, 2006. [14] M. J. Fischer and M. S. Paterson. String-matching and other products. In Complexity of Computation, volume 7, pages 113–125. SIAM AMS, 1974. [15] T. Flouri, E. Giaquinta, K. Kobert, and E. Ukkonen. Longest common substrings with k mismatches. Information Processing Letters, 115(6–8):643–647, 2015. [16] Z. Galil and R. Giancarlo. Parallel string matching with k mismatches. Theoretical Computer Science, 51(3):341–348, 1987. [17] S. Grabowski. A note on the longest common substring with k-mismatches problem. Information Processing Letters, 115(6–8):640 – 642, 2015. 12

[18] D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997. [19] D. Harel and R. E. Tarjan. Fast algorithms for finding nearest common ancestors. SIAM Journal of Computing, 13(2):338–355, May 1984. [20] L. Hui. Color set size problem with applications to string matching. In Proceedings of the 3rd Annual Symposium on Combinatorial Pattern Matching, volume 644 of Lecture Notes in Computer Science, pages 230–243, 1992. [21] L. Ilie, G. Navarro, and L. Tinta. The longest common extension problem revisited and applications to approximate string searching. Journal of Discrete Algorithms, 8(4):418–428, 2010. [22] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing, pages 604–613, 1998. [23] R. Karp and M. Rabin. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development - Mathematics and Computing, 31(2):249 –260, 1987. [24] T. Kociumaka, T. Starikovskaya, and H. W. Vildhøj. Sublinear space algorithms for the longest common substring problem. In Proceedings of the 22nd Annual European Symposium on Algorithms, volume 8737 of Lecture Notes in Computer Science, pages 605–617, 2014. [25] E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing (STOC), pages 614–623, 1998. [26] G. Landau and U. Vishkin. Efficient string matching with k mismatches. Theoretical Computer Science, 43:239–249, 1986. [27] C. Leimeister and B. Morgenstern. kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics, 30(14):2000–2008, 2014. [28] T. Starikovskaya and H. W. Vildhøj. Time-space trade-offs for the longest common substring problem. In Proceedings of the 24th Annual Symposium in Combinatorial Pattern Matching, volume 7922 of Lecture Notes in Computer Science, pages 223–234, 2013. [29] T. A. Starikovskaya. Longest common substring with approximately k mismatches. In 27th Annual Symposium on Combinatorial Pattern Matching, CPM 2016, LIPIcs, pages 21:1–21:11. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2016. [30] P. Weiner. Linear pattern matching algorithms. In Proceedings of the 14th IEEE Symposium on Foundations of Computer Science (FOCS), pages 1–11, 1973. [31] R. Williams. A new algorithm for optimal 2-constraint satisfaction and its implications. Theoretical Computer Science, 348(2-3):357–365, 2005.

13