Reconstructing Textual Documents from n-grams Matthias Gallé Xerox Research Centre Europe Meylan France
[email protected] ABSTRACT We analyze the problem of reconstructing documents when we only have access to the n-grams and their counts from the original documents and a fixed n. Formally, we are interested in recovering the longest contiguous substrings of whose presence in the original documents we are certain. We map this problem on a de Bruijn graph, where the ngrams form the edges and where every Eulerian cycles gives a plausible reconstruction. We define two rules that reduce this graph in such a way of preserving all possible reconstruction, while at the same time increasing the length of the edge labels. From a theoretical perspective we prove that the iterative application of these rules gives an irreducible graph equivalent to the original one. We then apply this on the data from the Gutenberg project to measure the number and size of the obtained longest substrings. Moreoever, we analyze how the n-gram corpus could be noised to prevent reconstruction, showing empirically that removing low frequent n-grams has little impact. Instead, we propose another method consisting in adding strategically fictitious n-grams and show that a noised corpus like that is much harder to reconstruct while increasing only little the perplexity of a language model obtained through it.
1.
INTRODUCTION
Organizations may be interested in releasing part of the data they own for reasons of general good, prestige, harnessing the work of those the data is released to or because it opens access to new sources (in a marketplace setting for instance). However, most of the times it is not possible to release the complete data due to privacy concerns, legal constraints or economic interest. A compromise is to release some statistics or embedding computed over this data. In the case of releasing n-gram counts of text documents (the case we study here), the most prominent example is ∗Contributed to this work while affiliated with the Xerox Research Centre Europe Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected]. KDD ’15 August 11 - 14, 2015, Sydney, NSW, Australia Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-3664-2/15/08$15.00. DOI: http://dx.doi.org/10.1145/2783258.2783361.
∗
Matías Tealdi
Universidad Nacional de Córdoba Córdoba Argentina
[email protected]
the Google Ngram Corpus [12], which provides access to the n-grams and their count (with n up to 5) of millions of OCR’ed books. This corpus provided a whole new way of analyzing historical trends (called cultoromics), but the fact of not having access to longer phrases restricts its uses. For example, a better linguistic analysis requires disambiguation of the phrases, for which longer substrings would be needed. This has been the reason for a major update of this corpus including Part-of-Speech tags [11]. Another example is the use and exchange of phrase tables for machine translation when the original parallel corpora are private or confidential [1]. In such a marketplace setting all beneficiaries have to contribute with their own statistics, which can be less informative than the original data. The question then arises of how much of the original data to disclose in order to provide useful information to other parties, but at the same time avoid reconstruction of the original documents. We analyze here the question of what are the longest blocks that can be reconstructed with total certainty starting from an n-gram corpus. A similar problem is solved routinely in DNA sequencing mapping the n-grams into a graph and finding an Eulerian cycles in this graph [5]. However, the number of different Eulerian cycles can grow worse than exponentially with the number of nodes, and only one of these cycles corresponds to the original document. We present a novel reduction of this de Bruijn graph into an irreducible form, from which large blocks of substrings of the document can easily be read off. In the next section (Sect. 2) we provide more context about the problem we are solving, and revise related work in Sect. 3. In Sect. 4 we give the necessary definitions needed to define our reduction steps (in Sect. 5) and show the results of our experiments on single and groups of books in Sect. 6. In Sect. 7 we study the impact of removing low-frequent ngrams, both on the reconstruction method as on the utility of the corpus (measured as perplexity of the obtained language model). Finally we propose another method to add noise that adds fictious n-grams instead of removing. The main contributions of this paper are: 1. Theorem 2 which states how the application of two reduction steps on the de Bruijn graph results in maximal substrings of the original data (there presence is sure, and any longer string is not sure) 2. Empirical results that show that applying these steps on real data gives very long strings (see Sect. 6, the numbers are influenced by n and the number of books),
and that this holds true even if low-frequent n-grams are removed $
3. A novel method to add noise to an n-gram corpus that is more resistant to reconstruction approaches, while providing a better language model than removing ngrams.
'$ a',1 a
2.
DE BRUIJN GRAPHS AND EULERIAN CYCLES
'a rose',3
We index strings starting from 1 (s[i] = i-the symbol of s) and denote concatenation of strings by . and sliding by : (s[i : j] = s[i] . . . s[j]). We assume that the only information we have is the list and respective occurrence count of n-grams (with n fixed) of the original document. As running example we will use the following variation of Gertrude Stein’s famous quote: “$ a rose rose is a rose is a rose #”. This sequence is hidden to us and, supposing n = 2, we only have access to the following table: n-gram $a a rose rose rose rose is is a rose #
count 1 3 1 2 2 1
Note that there are two other sequences who give place to exactly the same set of bigrams and which are therefore indistinguishable given only this table (“$ a rose is a rose rose is a rose #” and “$ a rose is a rose is a rose rose #”). For now we are supposing that the information we have is complete (no n-gram is missing) and correct (no noise and the counts are correct). Later on we will loose these assumptions. The problem of finding a possible document that generated such a table can be cast as the following graph problem. Create one node for each occurrence of each n-gram, and add an edge between node v and w if v[2 : n] = w[1 : n − 1], this is if the suffix of v is the same as the prefix of w. The problem can then be resolved by finding an Hamiltonian path (a path that visits each node exactly once). Concatenating the starting node and the last characters of all the other nodes – in the order the path visits them – reconstructs one possible document that may have generated the n-gram corpus Unfortunately, the Hamiltonian path problem is a known NP-hard problem [9]. However, it has since long been known that casting the problem differently leads to a polynomial solution. This formulation is also graph based, although the resulting graph is a multigraph where each edge has an associated multiplicity specifying the number of times it is repeated. The set of nodes will be all different n − 1 grams of the document. There will be k edges between node u and v if u.v[n] = u[1].v and the n-gram u.v[n] appears k times in the original document. The problem of finding a document that generated this set of n-grams is then equivalent of finding an Eulerian path in this graph, for which there exists well-known linear algorithms [6]. Fig. 1 shows that graph for our example sentence. For simplicity we call such a graph a de Bruijn graph, although the original use [7] gives a much stricter definition.
rose
'rose rose',1 'is a',2
'rose #',1
'rose is',2
#
is
(a) Original de Bruijn graph
'rose rose',1 'rose is a rose',2 $
'$ a rose',1
rose
'rose #',1
#
(b) Reduced to an irreducible form
Figure 1: An example of the reduction on the sequence “$ a rose rose is a rose is a rose #”
Alas, the problem of reconstructing the original document is not resolved by this. There may be more than one Eulerian path for a given graph, and the exact number of directed Eulerian graphs is given by the BEST theorem, which states that the number of Eulerian paths grows in proportion to the factorial of the degree of the nodes: Theorem 1 (BEST Theorem). Given an Eulerian graph G = (V, E), the number of different Eulerian cycles is |ec(G)| = Tw (G)
Y
(d(v) − 1)!
v∈V
here Tw (G) is the number of trees directed towards the root at a fixed node w in G. Each one of these cycles gives a document which is a plausible source of the tables of n-grams. Our goal is to find the maximal contiguous blocks of whose presence in the original text we are certain, given the evidence. In terms of the graph representation, this corresponds to find sub-paths that occur in any of the possible Eulerian cycles. In this case, these would be the overlapping blocks: “rose is a rose”; “$ a rose”; “rose rose”; and “rose #”. In order to obtain this set of maximal blocks we are going to reduce the graph to an irreducible form, where each edge corresponds to one of these blocks. The formal definition will be given afterwards, but Fig. 1 shows an example of such an irreducible graph on our example. The three different
Eulerian paths of this final graph (as defined by their labels) are precisely the three different possible original sequences.
3.
RELATED WORK
Trying to reconstruct a sequence from its set of n-grams is a standard problem in bioinformatics, where the search of an Eulerian path on a de Bruijn graph is one of the big success stories of graph theory [5]. This formulation was given and popularized by Pevzner [15, 14]. However, the nature of the application there poses different challenges than the one considered here. The goal there is to reconstruct a genome de novo from the samples obtained through sequencing techniques, which generate reads (currently of the size of 200 nucleotides) of DNA sequences from tissue. These reads are error prone and the standard procedure is to extract n-grams (with typically n = 20) and to construct the de Bruijn graph out of these. In most algorithms these are binary graphs, where the actual number of occurrences of a given n-gram is ignored. Stretches of simple paths, of nodes were din (x) = dout (x) = 1 are collapsed to from unitigs [3] and only then heuristics are used to created the contigs, which correspond to our maximal blocks. Ordering these contigs is known as scaffolding. Because the input may be noisy, several techniques have therefore be developed to detect these errors using the de Bruijn graph (see [13] for a recent review of tools). Applied to textual data, the only similar work we are aware of is [10] where a very basic approach was used, equivalent on extracting the unitigs we mentioned before. Surprisingly, what has been studied in the past is the construction of bigrams starting the count of unigrams only [17, 8] an arguably more difficult problem. A solution is made possible by assuming that the counts are given separately for each document (instead of aggregating them as we suppose here). Also, the EM-algorithm that is used as optimization procedure has trouble scaling up to documents longer than single sentences. [2] assumes that both parties have a debruijn Graph, and analyzes the cost of encoding the choices that have to be made to transfer to another party the unique reconstruction corresponding to the string to be transferred. The closest result to Theorem 2 is a result by Pevzner [15] showing that |ec(G)| = 1 if and only if the intersection graph of simple cycles of G is a tree. Nothing more is said about what happens when there is more than one Eulerian cycle. Note also that our definition of irreducible is not the same as uniqueness of Eulerian cycles, as several such cycles may be using the different edges with the same label and therefore still result in the same reconstruction.
4.
Definition 1. A graph G is a tuple (V, E), with V the set of nodes and E the set of edges, where each edge is of the form (hu, v, `i, k) with u, v ∈ V ; ` ∈ Σ∗ , k ∈ N ; where Σ is the vocabulary of the original sequence. Given an edge e = (hv, w, `i, k) we use the following terms to refer to its components: tail(e) = v, head(e) = w, label(e) = `, multiplicity(e) = k.
multiplicity(e);
e∈E:head(e)=v
and the outdegree dout (v) is
X
multiplicity(e).
e∈E:tail(e)=v
A graph is Eulerian if and only if it is connected and din (v) = dout (v) for all nodes v. In this case we define d(v) = din (v) = dout (v). We also define dˆin (v) = |{e ∈ E : tail(e) = v}| and ˆ dout (v) = |{e ∈ E : head(e) = v}| the number of different edges (as defined by their labels) entering and leaving v. Note that it is not necessarily the case that dˆin (v) = dˆout (v), even for Eulerian graphs. However, we require that the labels determine the edges uniquely. This is, ∀e1 , e2 ∈ E, if label(e1 ) = label(e2 ), then e1 = e2 . Neighbour edges on the de Bruijn graph overlap always by n − 1 symbols. We therefore define a overlapping concatenation operator n as α n β = α.β[n + 1 : |β|]. As an example: “$ a rose” 1 “rose is a rose” = “$ a rose is a rose”. If n is clear from the context, we just write . For adding or removing edges we will sometimes use the notation of E ∪ {(hv, w, `i, k)}, which means that it gets added to the set with multiplicity k if it is not yet present, or increases the respective multiplicity by k. An Eulerian cycle in such a graph is then a cycle that visits each edge e exactly multiplicity(e) times. We denote the set of all Eulerian cycles of G with ec(G). Given an Eulerian cycle c = e1 , . . . , en , its label sequence is the list `(c) = [label(e1 ), . . . , label(en )], and the string it represents is the concatenation of these labels: s(c) = label(e1 ) label(e2 ) . . . label(en ). Given a de Bruijn G constructed from a document d, there is one Eulerian cycle c ∈ ec(G) such that s(c) = d. We focus here on the problem of finding substrings as long as possible of which we are sure that they appear in the original sequence, given the evidence of the n-grams. Operating on the graph, we want to find another – more reduced – graph that has exactly the same Eulerian cycles (as given by the operator s defined before) and that can not be reduced further. Formally, given the original graph G, we are interested in a graph G∗ that: 1. is equivalent to G: {s(c) : c ∈ ec(G)} = {s(c) : c ∈ ec(G∗ )}
(1)
2. is irreducible: 6 ∃ e1 , e2 ∈ E ∗ : [label(e1 ), label(e2 )] appears in all `(c), c ∈ ec(G∗ ) (2)
DEFINITIONS
As can be seen from Fig. 1 we will be working with directed multigraphs, where an edge not only has a multiplicity attached to it, but also a label denoting the substring it represents. This motivates our following definition of graph:
X
The indegree of a node v, din (v) is
If a graph would be reducible, then there would exist two edges (e1 and e2 ), such that any Eulerian cycle traverses them in that order. They could therefore be merged, obtaining a more reduced graph that is still equivalent to the original one.
5.
REDUCTION STEPS
Our strategy is to start with the original de Bruijn graph, and to merge some edges iteratively until no further reduction can be done. That final graph should then (if the reductions steps are complete) be irreducible, and the edges of
α, 8 β, 1
γ, 1
x δ, 6 ε, 2
ζ, 2
Figure 3: General form of the Pigeonhole rule (Rule 1): if there is an incoming edges e1 and outgoing edge e2 such that a = multiplicity(e1 ) + multiplicity(e2 ) − d(x) > 0, then the rule can be applied.
Figure 2: Example of application for Rule 1. α, β and γ are incoming edges, while δ, and ζ are outgoing edges. Each Eulerian cycle has to pass 10 = d(x) times through x.
that final graph correspond exactly to the maximal strings we want to obtain. These merges will be controlled by two reduction steps which preserve Eulerian cycles (correctness), and such that their successive application ensures an irreducible Eulerian graph (completeness). The first reduction rule takes into account purely local information of a node, this is, the number of incoming and outgoing edges. As an example, consider Fig. 2, showing the complete local context of a node x with incoming edges α, β and γ and outgoing edges , δ and ζ. Consider now edge α: because any Eulerian cycle has to traverse each edge e exactly multiplicity(e) times, α will be used 8 times. Even if in 4 of these cases the path continues with or ζ, this still leaves 4 times where the only remaining option is to leave through edge δ. Therefore α δ has to occur at least 4 times. The general and formal rule is (see also Fig. 3): Reduction Rule 1. Let G = (V, E) be an Eulerian graph, x ∈ V with incoming edges (hv1 , x, `1 i, p1 ), . . . , (hvn , x, `n i, pn ) and outgoing edges (hx, w1 , t1 i, k1 ) . . . , (hx, wm , tm i, km ). Furthermore suppose that there exists 1 ≤ i ≤ n and 1 ≤ j ≤ m such that pi > d(x) − kj . Then we define the reduced graph G0 = (V 0 , E 0 ) such that • E 0 = E \ ({hvi , x, `i i, a), h(x, wj , tj i, a)}) ∪ {hvi , wj , `i tj i, a)} where a = pi − (d(x) − kj ). • if a = d(x) then we set V 0 to V \ {x}, if not it remains unmodified (V 0 = V ).
Figure 4: Example of application for Rule 2. Each edge, except for α has multiplicity 1.
only way of entering x from the left is through α, and the only way of leaving it to the right is through β and – most importantly – because x is a division point: Definition 2. A node x is a division point of an Eulerian graph G, if there exists nodes v, w (not necessarily different) for which all non-empty paths from v to w, and from w to v go through x. This is almost the same as the concept of articulation point [16], but it also includes self-loops (because v may be equal to w). As with articulation points they can be retrieved using a linear algorithm very similar to Tarjan’s algorithm. Removing any articulation point splits the graph into connected components that are themselves Eulerian graphs. We will define this formally: Definition 3. If x is a division point of an Eulerian graph G = (V, E) then we say that it generates graphs Gi , where there is one Gi for each: • self-loop e = {hx, x, `i, k}. In this case Gi = {{x}, {e}}
We will call this rule also the Pigeonhole rule. Of course, there can be more than one application of that rule for any given node. Note also that this rule generalizes the most basic case, when a node has only one incoming and one outgoing edge (in which case a = d(x)). Local information alone however is not enough to obtain irreducible graphs. Consider the example of Fig. 4. In this case x divides the graph into two parts (a double-triangle to its left and a simple one to its right). Rule 1 cannot be applied on node x, however each path that goes from the left part to the right part not only has to go through x, but has to do so using the edges α and β. That is so because the
• connected component that remains by eliminating x. In this case, Vi will be the nodes of this component, plus x. Ei = {e : head(e), tail(e) ∈ Vi , e is not a self-loop over x} It is easy to see that each component Gi contains an Eulerian cycle. For the self-loops this is trivially true. For the other cases they are connected by definition, and the degrees of nodes different than x are not modified, so they are balanced. It remains to see that the copy of x in each graph remains balanced. Because G had an Eulerian cycle, the
Figure 5: General form of Rule 2: x is a division point, dividing the graph into two components and has only one incoming edge from one component (from node v, labeled with ` and multiplicity p) and one outgoing edge to the other component (to node w, labeled t with multiplicity k). Because at least once it has to cross components, there has to be an edge ` t.
only way of crossing components is through x and each circuit that leaves x into component Gi has to return to x (from the same component because x is division point) before visiting another component. This notion that each Eulerian cycles inside components are “independent” of what occurs in other components will be a recurrent argument. Because all Eulerian cycles have to cross x, they are a natural points to try reducing the graph. However, as we have seen in the example, care has to be taken on the number of different incoming/outgoing edges from the same component: Reduction Rule 2. Let G = (V, E) be an Eulerian graph and x ∈ V an division point that divides G in exactly two components G1 , G2 . If dˆinG1 (x) = 1 and dˆoutG2 (x) = 1 (we denote these unique edges as (hv, x, `i, p) and (hx, w, ti, k), respectively), we then define the reduced graph G0 = (V 0 , E 0 ) such that: 0
• E = (E\{(hv, x, `i, 1), (hx, w, ti, 1)})∪{(hv, w, `.ti , 1)} • V0 =V See Fig. 5. Note that this rule never eliminates nodes, as there is at least one additional incoming and outgoing edge of x.
5.1
Correctness and Completeness
It remains to see now that both the reduction steps defined before satisfy both Equation 1 (correctness) and that their successive application results in a graph satisfying Equation 2 (completeness). This subsection is technical in nature and its interest is mostly theoretical.
5.1.1
Correctness
We will proof that for both steps a single application results in an equivalent graph:
Rule 1. We will divide this case, depending if x is division point or not. If x is not a division point, than an incoming path can take any outgoing path and still result in an Eulerian
cycle. Then, the argument we will use is an application of the pigeonhole principle: there are d(x) − pi incoming edges that do not have label `i and which – in principle – can take any of the outgoing edges. Even if they all leave x through edge tj , there would still be kj − (d(x) − pi ) free tj edges to use which have to be taken care of some of the remaining `i edges. Therefore, in all Eulerian cycles, [`i , tj ] appears at least kj − (d(x) − pi ) times, which we model explicitely by creating that edge with its corresponding multiplicity. Suppose now that x is a division point. Rule 1 cannot be applied on edges from different components, because d−kj ≤ pi if tj and `i belong to different components. But inside the same component the same principle applies as before: any incoming edge can be followed by any outgoing edge.
Rule 2. This rules gets applied if there are exactly two components G1 , G2 and such that to go from (suppose without loss of generality) G1 to G2 x has to be used. The only way of reaching x through G1 is through v, and the only way of leaving x into G2 is through w. Therefore, in all Eulerian circuits [label(v), label(w)] appears at least once and we only merge these two edges into one.
5.1.2
Completeness
We now suppose that G∗ is such that neither rule can be applied and we will prove that G∗ satisfies Eq. 2. Note that because in any step the total number of edges (counted as the total sum of multiplicities) decreases, successive applications of both rules will eventually lead to a graph where no rule can be applied. The way we will prove this is to show that any node of G∗ is ambiguous, in the sense that an incoming edge does not determine univocally – for all Eulerian cycles – which outgoing edge to take. We will proceed by cases, depending if a node x is division point, and if it is, in how many components it divides the graph. x
is not a division point.
In particular, this means that after leaving x through any edge, an Eulerian cycles can re-visit x through any of the incoming edges (if not, it would be possible divide the graph at x). Furthermore – as for all other nodes of G∗ – it holds that multiplicity(ei ) ≤ d(x)−multiplicity(ej ) for all incoming edges ei and outgoing edges ej . Therefore no single incoming edge ei can exhaust all but one outgoing edges, and therefore force any future income through ei to leave through this single outgoing edge. Because any combination of incoming and outgoing edges result can result in a valid Eulerian cycle, this concludes this case.
is division point and dividing G in exactly two components. x
Let be G1 , G2 these two components. Because the precondition of rule 2 does not hold, dˆinG1 > 1 or dˆoutG2 > 1 and additionally pi ≤ d(x) − kj for all incoming edges i and outgoing edges j. But now the same argument applies as in the previous paragraph, as G1 , G2 are Eulerian. For the crossing point, when the path leaves G1 to enter G2 , suppose that dˆoutG2 > 1 (if not, then dˆinG1 > 1 and the argument is symmetric). This means that there are at least two possibilities of continuing into G2 from x. Now, any of these possibilities
may result in a valid Eulerian cycle, because moving out of G2 it will have to pass through x again.
is division point and divides G into more than two components. x
Let these components be G1 , G2 , . . . , Gn . Suppose that [`1 , `2 ] appear in all Eulerian cycles, and that `1 is an incoming edge (e1 ) of x while `2 is outgoing (e2 ). If e1 ∈ Ei1 , then e2 cannot be also in Ei1 , due to similar arguments as before and because the precondition of Rule 1 does not hold. Suppose therefore that it belongs to Ei2 6= Ei1 . But because the Eulerian circuits of the Gi ’s are independent, we can simply construct a different Eulerian circuit that enters Gi3 6= Gi1 , Gi2 after using e1 and before using e2 . We then have: Theorem 2. Applying successively Rule 1 and Rule 2 on a de Bruijn graph G results in a graph G∗ satisfying the conditions from Equations 1 and 2. A final remark to note is that a graph G∗ satisfying Equations 1 and 2 is not necessarily unique. That is because the definitions of irreducibility treats labels as atomic symbols. Consider for instance a graph G∗ where in all label sequences of Eulerian cycles `(c), label `1 is preceded always by either `2 or `3 , and that both finish with the string α (with |α| ≥ n). In that case α `1 appears certainly in any plausible reconstruction, but is not captured by the edges of G∗ . In our experiments, changing the order the rules are applied did change the final irreducible graph, although only slightly. Another information that we did not take advantage of is the information of the order of the blocks given by G∗ .
5.2
Running Time
Rule 1 is applied at most once per edge, and each application can be done in constant time supposing that we store incoming and outgoing edges at each node. The most important aspect for Rule 2 is to detect division points, and the number of components they generate. Tarjan’s original algorithm [16] can be adapted for this and runs in linear time.
6.
EXPERIMENTS
We performed experiments on English out-of-copyright books from the Gutenberg Project1 . For all experiments, the “April 2010” release (containing in total more than 29, 500 books) was used, keeping only those nooks labeled as English. All documents were segmented using white spaces as segmentation points and n-grams were extracted considering words as atomic symbols. No further pre-treatment was done. A first surprise was the few times that Rule 2 (using the divisions points) was applied with respect to Rule 1, making up only between 0.01% and 0.05% of the total applications. Because at the same time it is the computationally more expensive (due to the pre-calculation of division points), in the coming experiments we therefore neglected that rule and only used Rule 1. 1
www.gutenberg.org
In a first instance, we considered the books independently and run our reductions on each one separately. For these experiments we sampled randomly 1000 books from the Gutenberg projects (the same ones for each n-gram). Fig. 6a shows a boxplot on the average length of the maximal blocks obtained for each book, and in Fig. 6b we show the same for maximal sizes. For instance, using n = 5, the final maximal blocks have an mean average length of 54.44, corresponding to roughly 5 lines of text of the current paper. The mean maximal length is 658.34, almost the number of words on the first page of this paper. Note the logarithmic scale: for n = 3, the mean average length of blocks is 16.26, while for n = 6 it jumps up to 164.18. The books have different lengths, and the longer the books the more variation there may be and the harder to find maximal blocks. We plot these differences in Fig. 7 where we now consider the number of blocks (not their length). For large values of n (as for 10 in the figure), several books can almost be totally reconstructed with less than 10 final blocks. In many of these cases, all blocks but one correspond to substrings of the license agreement boilerplate, which is repeated inside the document. When this boilerplate was removed, many more books were reconstructed as one single block. In a second instance we studied n-grams corpora extracted from more than one book. Because more variations are included, it is in general a harder problem to reconstruct whole collections of books. We computed again average and maximal size of blocks and show them in Fig. 8 with respect to different number of books, where we fixed n = 5. Each datapoint at x corresponds to the concatenation (separated by unique markers) of x randomly selected books, and in each case we show the average over 100 different random selection. Note that effectively the average block-size decreases slowly, showing that there are much more smaller blocks. However, the maximal size of retrieved substrings keeps increasing reaching almost 12 000 for 1 000 books. Fig. 8a shows a more detailed view, with a boxplot of the number of blocks longer or equal to 100 words. This number also keeps increasing with increasing number of books. The threshold of 100 was chosen arbitrary, corresponding to some non-trivial block of original text. Remember that whatever block is reconstructed occurs with absolute certainty in the original document.
7.
NOISING
Of course having complete information is not a realistic assumption. The Google Ngram Corpus for instance does not give information on n-grams that occur less than 40 times in the whole corpus. This is the most popular choice, as the provider still gives away correct information (no noise is introduced) and the infrequent n-grams are arguably less interesting for most use cases. We imitated this behavior on our Gutenberg corpus. For this experiment, we fixed the number of books to 100, and n = 5. In Fig 9 we repeat the same plots as before, but now indexing with respect to the minimal number of occurrences (M in the Figure) that an n-gram has to have to be kept in the corpus. In general, the number of blocks decreases because the graph becomes much sparser. However, the long blocks are maintained, which is reflected in the general increase of the average block size
(a) Mean size of average blocks.
(a) Mean size of average (to the left, with blue crosses) and maximal (to the right, with red circles) blocks.
(b) Mean size of maximal blocks.
Figure 6: Mean maximal and average block size, using data from single books.
(b) Boxplot of number of blocks with 100 or more words.
Figure 8: Impact of number of books on the reconstruction algorithm. n = 5.
Figure 7: Number of blocks as a function of document size.
(blue line), although there is small drop when M = 15. Note however that the absolute variation is small (between 8.5 and 11) More interestingly the maximal block size changes little after all n-grams occurring only once are removed (red line). However, the returned blocks may not be actually true substrings of the original corpus. By removing some edges, the assumptions for the correctness of the rules are not satisfied any more. We measured therefore if the returned edges are indeed correct (Fig. 9b), for several groups of edges depending on their lengths. Most long edges, of which there was a still a considerable number2 , were correct. Even when considering all possible edges, in general more than 95% of the returned blocks corresponded to a true substring of one of the original documents. We will now study another method that adds noise to an n-gram corpus and which is specifically targeted to avoid reconstruction. However, that goal is in opposition to the goal of providing a useful corpus: it is easy to add noise to a corpus if the utility can be neglected. In order to ground the notion of utility of the corpus, we will measure it as the perplexity of a language model deduced from it, because such a model is a starting block for many applications (most notably machine translation and speech recognition)
7.1
Method
Instead of removing n-grams, we therefore propose another method, which consist in adding fictious n-grams. Our method targets specifically the application of the Pigeonhole rule. While this may seem very constrained, we believe that the success of the reconstruction method based upon that step points to some crucial structural properties of the deBruijn graph. We will call each node x such that δ(x)
=
max
max
e∈E:head(e)=x f ∈E:tail(f )=x
(a) Mean size of average (to the left, with blue crosses) and maximal (to the right, with red circles) blocks, with respect to M , the threshold for filtering n-grams.
multiplicity(e) +
multiplicity(f ) − d(x) > 0
(see again Fig. 3) an irregular node. It is on irregular nodes that the Pigeonhole rule gets applied. The method we propose does two things: on one hand it polishes irregular nodes to become regular, and on the other hand it disturbs a random set of nodes creating false irregular nodes.
7.1.1
Polishing
The detailed procedure is given in Alg. 1. That procedure adds 2K edges to the graph, converting at most K nodes into regular ones. For this, it picks up K random nodes, and for each selected node it looks for two compatible nodes: u is compatible with v if they overlap with a substring of size n−2 ((written c(u, v) = u[|u|−(n−2)+1 : |u|] == v[1 : n − 2]). In order to avoid adding false n−gram with a too big multiplicity, this is thresholded by a second parameter δmax . Finally, note that the given algorithm (which is also the one we used in our experiments) breaks the eulerianess of the graph (this is, that all nodes are balanced). This can be avoided by grouping all K nodes by their δ(n) and creating 2 for instance, an average of 49 for edges greater or equal than 50 words (red line)
(b) Percentage of correct blocks of different lengths with respect to M , the threshold for filtering n-grams.
Figure 9: Running with incomplete data, where all n-grams occurring less or equal than M times are removed. Each datapoint is an average over 10 samples of 100 randomly concatenated books. In all cases we used 5-grams.
Eulerian cycles for each group. Doing so would hide the fact that the corpus was modified as there exists a plausible reconstruction that corresponds to such a de Bruijn graph. Algorithm 1 Polishing of irregular nodes polish (K, δmax ) 1: for K times do 2: n = pick up a random irregular node 3: δ = min(δ(n), δmax ) 4: u, v = pick up two random compatible nodes (c(u, n) and c(n, v)) 5: add edges (hu, n, u vi, δ) and (hn, v, n vi, δ) 6: end for
7.1.2
Disturbing
Besides removing irregularities in the graph, another way of tricking the Pigeonhole rule is by creating false irregularities. To do so, we propose to add edges so that δ(x) becomes positive. This is done for example by adding an edge with multiplicity of d(n) + a, where a is positive. In order to fix the value of a, we use an exponential distribution. Algorithm 2 Creating irregular nodes disturb (K, λ) 1: for K times do 2: n = pick up a random node 3: m ∼ exp(λ) 4: m = d(n) + bmc + 1 5: u, v = pick up two random compatible nodes (c(u, n) and c(n, v)) 6: add edges (hu, n, u vi, m) and (hn, v, n vi, m) 7: end for
7.2
Evaluation
In order to compare both noising approaches, we will consider the number of modified edges versus the error rate of running the reconstruction algorithm on the noised corpus. The blue crosses of Fig. 10 correspond to the same experiment as in Fig. 9b, where the x-axis now corresponds to the number of edges (n-grams) that were removed. Against that we compared the application of both algorithms of the previous subsection. For simplicity we fixed the value of both K to the same value in both algorithms, resulting in an addition of a total of 4K edges. δmax was set to 20 and λ to 0.5. The error rate of that method is given by the red circles in Fig. 10. With a much lower number of modified edges a much higher error rate is achieved, reaching 20% with K = 1 000 000. Of course, hindering reconstruction alone is easy. If such a data-set was released in first place it is assumed to be useful in some sense. In order to ground such a utility in a measurable way, we assume that the goal is to construct a language model out of the collected n−grams, a goal that covers many different application. We therefore evaluate the perplexity of a language model created by the noised corpora (either through removing or adding edges), using Good-Turing discounting3 . 3
we used the CMU-Cambridge Statistical Language Mod-
Figure 10: Accuracy of the Pigeonhole rule with respect to removing edges (blue crosses) and the method described here (red circles).
For testing, we took an additional 100 random books (not contained in the ones used to create the n-gram corpus) and report average perplexity. From Fig. 11 it can be seen that the perplexity deteriorates quickly when removing less frequent n-grams. When adding edges, the deterioration is not only less acute, but also more gentle, making it easier to control it better. Note that the baseline perplexity – corresponding to M = 0 – is 7.55, only slightly lower than the one obtained by adding edges.
8.
CONCLUSIONS
We analyzed the possibility of reconstructing documents when the only available information is the count of its ngrams (with n fixed). This reduces to the problem of finding an Eulerian cycle in the de Bruijn constructed from these ngrams. Any of these cycles is a plausible reconstruction of the document, however only one of them is the correct one. We proposed a method that merges edges into bigger blocks and whose successive application results in an equivalent and irreducible graph. The method consist of two rules: the first is are based on the local information provided by single nodes; while the second one uses the global structure of the graph through the use of division points, an extended definition of articulation points (even if this last rule does not get applied often in practice). Experiments on real data show the capacity of reconstructing whole books when n is sufficient big (typically 10). Even with a more reasonable value of n, whole pages can be reconstructed. Theorem 2 ensures that these reconstructed sections are correct, that is, that they appear with certainty given that the original data is complete and correct. We also studied methods to prevent such a reconstruction, under the assumption that the data provider has two opposing goals: (1) maximize the utility of the corpus (measured here as perplexity of the obtained language model) while (2) preventing reconstruction of substrings longer than those provided. Under these hypotheses, we showed that the technique of removing low frequent n-grams fails on both goals. Instead, we propose another method that adds fictitious ngrams in a strategic and non-deterministic way, and shows that it performs better when preventing reconstruction while eling Toolkit v2: http://svr-www.eng.cam.ac.uk/~prc14/ toolkit.html
(a) Perplexity versus removing all n-grams occurring fewer or equal than M times.
(b) Perplexity versus running Alg. 1 and 2 with parameter K.
Figure 11: Influence over quality of language modeling when perturbing original n-gram counts. Here, n = 5 and the starting corpus corresponds to 100 randomly selected books.
deteriorating little over a language model obtained with perfect data. Running time has not been an issue in our experiments, as all concerned algorithms run in linear time and the most applied rule (Rule 1) consists only in comparison of the multiplicity of edges. Memory however is a bottleneck when running on standard machines and as soon as the word counts of all the books exceeds 100 million. Our implementation has not been optimized towards memory usage, but this is a wellstudied problem in genome assembling (using Bloom filters for instance to hash the n-grams [4]), although as mentioned the problem instances are different. Distributed graph libraries4 also exists were running time could be traded in order to scale up to bigger graphs taking advantage that the algorithm uses only local information. So far we have not considered the content of each n-gram in the decision of when to merge to edges. An heuristic, based for instance on a language model of (n − 1)-grams, or readability criteria, could be an effective guide that obtains longer final blocks but also includes some more erroneous blocks.
9.
REFERENCES
[1] Nicola Cancedda. Private access to phrase tables for statistical machine translation. In ACL, pages 23–27, 2012. 4
for example https://giraph.apache.org/
[2] Vikas Chauhan and Ari Trachtenberg. Reconciliation puzzles [separately hosted strings reconciliation]. In Global Telecommunications Conference, 2004. GLOBECOM’04. IEEE, volume 2, pages 600–604. IEEE, 2004. [3] Rayan Chikhi, Antoine Limasset, Shaun Jackman, Jared Simpson, and Paul Medvedev. On the representation of de bruijn graphs. arXiv:1401.5383, 2014. [4] Rayan Chikhi and Guillaume Rizk. Space-efficient and exact de bruijn graph representation based on a bloom filter. Algorithms for Molecular Biology, 8:22, 2013. [5] Phillip E C Compeau, Pavel A Pevzner, and Glenn Tesler. How to apply de Bruijn graphs to genome assembly. Nature biotechnology, 29(11):987–91, November 2011. [6] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms (3. ed.). MIT Press, 2009. [7] N. G. De Bruijn. A combinatorial problem. Koninklijke Nederlandse Akademie v. Wetenschappen, 49:758–764, 1946. [8] Nathanael Fillmore, Andrew B Goldberg, and Xiaojin Zhu. Document recovery from bag-of-word indices. Technical report, University of Wisconsin–Madison, 2008. [9] Michael R. Garey and David S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, 1979. [10] Piyush Kansal and Himanshu Jindal. Reconstructing books using google n-grams. Master’s thesis, Stony Brook University, 2011. [11] Yuri Lin, Jean-Baptiste Michel, Erez Aiden Lieberman, Jon Orwant, Will Brockman, and Slav Petrov. Syntactic annotations for the google books ngram corpus. In ACL (System Demonstrations), pages 169–174, 2012. [12] Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden. Quantitative analysis of culture using millions of digitized books. Science, 331(6014):176–182, 2011. [13] Jason R Miller, Sergey Koren, and Granger Sutton. Assembly algorithms for next-generation sequencing data. Genomics, 95(6):315–27, June 2010. [14] P a Pevzner, H Tang, and M S Waterman. An Eulerian path approach to DNA fragment assembly. PNAS, 98(17):9748–53, August 2001. [15] Pavel A Pevzner. 1-Tuple DNA Sequencing : Computer Analysis. Journal of Biomolecular, 1989. [16] Robert Tarjan. Depth-First Search and Linear Graph Algorithms. SIAM Journal on Computing, 1(2):146–160, June 1972. [17] Xiaojin Zhu, Andrew B Goldberg, Michael Rabbat, and Robert D Nowak. Learning bigrams from unigrams. In ACL, pages 656–664, 2008.