Computing Entropy Rate Of Symbol Sources & A Distribution-free Limit ...

1 downloads 0 Views 521KB Size Report
Mar 21, 2014 - compute the entropy rate; they are indeed full-scale data compression utilities, that produce a decodable representation of the input. Can we.
1

Computing Entropy Rate Of Symbol Sources & A Distribution-free Limit Theorem

arXiv:1401.0711v2 [cs.IT] 21 Mar 2014

Ishanu Chattopadhyay [email protected]

Hod Lipson [email protected]

Abstract—Entropy rate of sequential data-streams naturally quantifies the complexity of the generative process. Thus entropy rate fluctuations could be used as a tool to recognize dynamical perturbations in signal sources, and could potentially be carried out without explicit background noise characterization. However, state of the art algorithms to estimate the entropy rate have markedly slow convergence; making such entropic approaches non-viable in practice. We present here a fundamentally new approach to estimate entropy rates, which is demonstrated to converge significantly faster in terms of input data lengths, and is shown to be effective in diverse applications ranging from the estimation of the entropy rate of English texts to the estimation of complexity of chaotic dynamical systems. Additionally, the convergence rate of entropy estimates do not follow from any standard limit theorem, and reported algorithms fail to provide any confidence bounds on the computed values. Exploiting a connection to the theory of p probabilistic automata, we establish a convergence rate of O(log |s|/ 3 |s|) as a function of the input length |s|, which then yields explicit uncertainty estimates, as well as required data lengths to satisfy pre-specified confidence bounds. Index Terms—Entropy rate, Stochastic processes, Probabilistic automata, Symbolic dynamics

I. Motivation, Background & Contribution The entropy rate of a stationary and ergodic process converges in probability to the per-letter Kolmogorov complexity of a single sufficiently long sample path [1]. While Kolmogorov complexity is incomputable, entropy rates can, in principle, be estimated. Ability to quantify the complexity of a signal source, even in the average sense, can provide valuable insights into the driving dynamics; and can potentially be used as a tool to detect dynamical anomalies without explicit knowledge of background noise processes. However, source entropy rate estimation from an observed sample path is computationally non-trivial. Even with the assumptions of ergodicity and stationarity, one cannot fruitfully apply the defining relation in Eq.(6) due to the exponential increase in the number of different words with the word-length. This is particularly important if there are long-range dependencies in the symbol stream. Such dependencies introduce additional long-range structure; decreasing the source entropy in the process. In such cases unacceptably long words or blocks must be considered, and pre-mature truncation of the computation would lead to large errors. The best known algorithms that carry out a more efficient computation are based on Lempel-Ziv (LZ) source coding [2], [3], [4]. The LZ coding algorithms are asymptotically optimal, i.e. their compression rate approaches the source entropy rate for any ergodic stationary stochastic process. The key idea here is adaptive dictionary compression: parse the input string into distinct phrases, and represent them with codewords, making sure that short codewords are assigned to common phrases. Done optimally, one ends up with a compressed string, such that the ratio of the input and output lengths approach the source entropy rate. Different variations on this idea have been reported [5], [6]. Techniques distinct from LZ parsing are also known, e.g., Rissanen [7] reported a universal compression scheme, which instead of gathering parsed segments of the input along with their occurrence counts, collects the “contexts” in which each symbol of the input string occurs, together with conditional occurrence counts. Importantly, a majority of the reported techniques do more than just compute the entropy rate; they are indeed full-scale data compression utilities, that produce a decodable representation of the input. Can we do better if we are only interested in the former? This paper provides an affirmative answer to this possibility. Secondly, existing techniques lack convergence rate estimates; computation of error bars for reported approaches do not follow from any standard limit theorem. There is indeed no analytical

Stationary ergodic stochastic process

Hidden Process

States are hidden 01010010101011100101010101010   • Entropy rate? • Convergence Rate of estimate?

Problem description. Given a quantized data stream, how do we compute the entropy rate of the hidden process? Even with the assumption of stationarity and ergodicity for the generator, reported algorithms converge very slowly. Additionally, these convergence rates are unknown for such approaches; implying that we cannot put uncertainty bounds on the computed values in practice. We show that a significantly faster computation of the entropy rate is possible; and derive a universal lower bound on how slowly this convergence might occur.

Fig. 1.

way to check for the internal consistency of the estimation or its accuracy. We may observe gradual convergence to a limiting value, and this is indeed guaranteed by theory; but are unable to provide uncertainty bounds on the computed estimate with finite inputs. Typically observed slow convergence in all non-trivial scenarios, for all reported algorithms, makes this a key issue. An empirical relationship, without proof or theoretical backing, has been suggested [8], which conjectures the |s|-dependence (|s| being the e to follow length of the input s) of the estimated entropy rate H |s| e ' Hactual +c log γ , where c , γ are fit parameters. In this paper, we H |s| show that,p at least with our algorithm, the convergence rate is given by O(log |s|/ 3 |s|). This is a distribution-free result, in the sense that the asymptotic bound does not depend on the source characteristics. In consequence, we can derive explicit uncertainty estimates at specified confidence bounds on the estimated entropy rate for finite-length input data. A. Key Insight Our approach is based on modeling discrete and finite-valued stationary and ergodic sources as probabilistic automata. Our automata is distinct from that of Paz [9], and each model in our case is in fact an encoding of a measure defined on the space of strictly infinite strings over a finite alphabet. While the formalisms are completely different, some aspects of this approach has subtle parallels to that of Rissanen’s “context algorithm” [7]; his search for contexts which yield similar probabilities of generating future symbols is analogous to our search for a synchronizing string in the input stream - a finite sequence of symbols that, once executed on a probabilistic automaton, leads to a fixed state irrespective of the initial conditions. Of course we do not know anything about the hidden model a priori; but nevertheless we establish that such a string, at least in a well-defined approximate sense, always exists and is identifiable efficiently. Finally, we show that, given such an approximate synchronizing string, we can use results from non-parametric statistics to bound the probability of error as a function of the input length. B. Entropy & Entropy Rate Entropy H(X) of a discrete random variable X, taking values in the alphabet Σ, is defined as: X H(X) = − p(x) log p(x) (1) x∈Σ

2

where p(x) is the probability of occurrence of x ∈ Σ. The base of the logarithm is generally taken to be 2, and then the entropy is being expressed in bits. While the definition of entropy of a random variable may be obtained axiomatically, a perhaps more compelling approach is to show that it arises as the average length of the shortest description of a random variable [10]. The joint entropy of a set of random variables X1 , · · · , Xn , with Xi taking values in the alphabet X Σi , is defined in the usual manner: H(X1 , · · · , Xn ) = − p(x1 , · · · , xn ) log2 p(x1 , · · · , xn ) (2) xi ∈Σi

The chain rule for entropy calculations [10] follows from the definitions, and is of particular importance: H(X1 , · · · , Xn ) =

n X

H(Xi |Xi−1 , · · · , X1 )

(3)

i=1

The notion of entropy formalizes the Asymptotic Equipartion Property (AEP): If discrete random variables X1 , · · · , Xn are i.i.d. and have probability mass function p(x), then we have: X 1 a.s p(x) log p(x) (4) − log p(X1 , · · · , Xn ) −−→ H(X) = − n

x∈Σ

The AEP implies that nH(X) bits suffice on average to describe n i.i.d. random variables. If the random variables are not independent, the entropy H(X1 , · · · , Xn ) still grows asymptotically linearly with n at a rate known as the entropy rate of the process. In particular, if the random variables define a stationary ergodic stochastic process X = {Xi }, then the AEP still holds: 1 a.s. (5) − log p(X1 , · · · , Xn ) −→ H(X) n where H(X) is the entropy rate of the process defined as: 1 H(X) = lim H(X1 , · · · , Xn ) (6) n→∞ n As in the case of the i.i.d variables, typical sequences of length n may be represented using approximately nH(X) bits. Thus the entropy rate quantifies the average description length of the process, and hence its expected complexity [1].

II. Stochastic Processes & Probabilistic Automata As mentioned earlier, our approach hinges upon effectively using probabilistic automata to model stationary, ergodic processes. Our automata models are distinct to those reported in the literature [9], [11]. The details of this formalism can be found in [12]; we include a brief overview here for the sake of completeness. Notation 1. Σ denotes a finite alphabet of symbols. The set of all finite but possibly unbounded strings on Σ is denoted by Σ? [13]. The set of finite strings over Σ form a concatenative monoid, with the empty word λ as identity. The set of strictly infinite strings on Σ is denoted as Σω , where ω denotes the first transfinite cardinal. For a string x, |x| denotes its length, and for a set A, |A| denotes its cardinality. Also, Σd+ = {x ∈ Σ? s.t. |x| 5 d}. Definition 1 (QSP). A QSP H is a discrete time Σ-valued strictly stationary, ergodic stochastic process, i.e. H = {Xt : Xt is a Σ-valued random variable, t ∈ N ∪ {0}} (7) A process is ergodic if moments may be calculated from a sufficiently long realization, and strictly stationary if moments are time-invariant. We next formalize the connection of QSPs to PFSA generators. We develop the theory assuming multiple realizations of the QSP H, and fixed initial conditions. Using ergodicity, we will be then able to apply our construction to a single sufficiently long realization, where initial conditions cease to matter. Definition 2 (σ-Algebra On Infinite Strings). For the set of infinite strings on Σ, we define B to be the smallest σ-algebra generated by the family of sets {xΣω : x ∈ Σ? }. Lemma 1. Every QSP induces a probability space (Σω , B, µ). Proof: Assuming stationarity, we can construct a probability measure µ : B → [0, 1] by defining for any sequence x ∈ Σ? \ {λ}, and

a sufficiently large number of realizations NR (assuming ergodicity): # of initial occurrences of x # of initial occurrences of all sequences of length |x| and extending the measure P to elements of B \B via at most countable sums. Thus µ(Σω ) = x∈Σ? µ(xΣω ) = 1, and for the null word µ(λΣω ) = µ(Σω ) = 1.

µ(xΣω ) = lim

NR →∞

Notation 2. For notational brevity, we denote µ(xΣω ) as Pr(x). Classically, automaton states are equivalence classes for the Nerode relation; two strings are equivalent if and only if any finite extension of the strings is either both in the language under consideration, or neither are [13]. We use a probabilistic extension [14]. Definition 3 (Probabilistic Nerode Equivalence Relation). (Σω , B, µ) induces an equivalence relation ∼N onthe set of finite strings Σ? as: ∀x, y ∈ Σ? , x ∼N y ⇐⇒ ∀z ∈ Σ? Pr(xz) = Pr(yz) = 0  _ Pr(xz)/Pr(x) − Pr(yz)/Pr(y) = 0

(8)

Notation 3. For x ∈ Σ? , the equivalence class of x is [x]. It is easy to see that ∼N is right invariant, i.e. (9) x ∼N y ⇒ ∀z ∈ Σ? , xz ∼N yz A right-invariant equivalence on Σ? always induces an automaton structure; and hence the probabilistic Nerode relation induces a probabilistic automaton: states are equivalence classes of ∼N , and the transition structure arises as follows: For states qi , qj , and x ∈ Σ? , σ

([x] = q) ∧ ([xσ] = q 0 ) ⇒ q − → q0

(10)

Before formalizing the above construction, we introduce the notion of probabilistic automata with initial, but no final, states. Definition 4 (Initial-Marked PFSA). An initial marked probabilistic finite state automaton (a Initial-Marked PFSA) is a quintuple e, q0 ), where Q is a finite state set, Σ is the alphabet, (Q, Σ, δ, π e : Q × Σ → [0, 1] δ : Q × Σ → Q is the state transition function, π specifies the conditional symbol-generation probabilities, and q0 ∈ Q e are recursively extended to arbitrary is the initial state. δ and π y = σx ∈ Σ? as follows: ∀q ∈ Q, δ(q, λ) = q (11) δ(q, σx) = δ(δ(q, σ), x) (12) e(q, λ) = 1 ∀q ∈ Q, π (13) e(q, σx) = π e(q, σ)e π π(δ(q, σ), x) (14) Additionally, we impose that for distinct states qi , qj ∈ Q, there e(qi , x) > 0. exists a string x ∈ Σ? , such that δ(qi , x) = qj , and π Note that the probability of the null word is unity from each state. If the current state and the next symbol is specified, our next state is fixed; similar to Probabilistic Deterministic Automata [15]. However, unlike the latter, we lack final states in the model. Additionally, we assume our graphs to be strongly connected. Later we will remove initial state dependence using ergodicity. Next we formalize how a PFSA arises from a QSP. Lemma 2 (PFSA Generator). Every Initial-Marked PFSA G = e, q0 ) induces a unique probability measure µG on the (Q, Σ, δ, π measurable space (Σω , B). Proof: Define set function µG on the measurable space (Σω , B): µG (∅) , 0 (15) e(q0 , x) ∀x ∈ Σ? , µG (xΣω ) , π (16) ∀x, y ∈ Σ? , µG ({x, y}Σω ) , µG (xΣω ) + µG (yΣω ) (17) Countable additivity of µG is immediate, and (See Definition 4): e(q0 , λ) = 1 µG (Σω ) = µG (λΣω ) = π (18) implying that (Σω , B, µG ) is a probability space. We refer to (Σω , B, µG ) as the probability space generated by the Initial-Marked PFSA G. Lemma 3 (Probability Space To PFSA). If the probabilistic Nerode relation corresponding to a probability space (Σω , B, µ) has a finite index, then the latter has an initial-marked PFSA generator.

3

Proof: Let Q be the set of equivalence classes of the probabilistic Nerode relation (Definition 3), and define functions δ : Q × Σ → Q, e : Q × Σ → [0, 1] as: π δ([x], σ) = [xσ] (19) Pr(x 0 σ) for any choice of x 0 ∈ [x] (20) Pr(x 0 ) ? e recursively to y = σx ∈ Σ as where we extend δ, π δ(q, σx) = δ(δ(q, σ), x) (21) e(q, σx) = π e(q, σ)e π π(δ(q, σ), x) (22) For verifying the null-word probability, choose a x ∈ Σ? such that [x] = q for some q ∈ Q. Then, from Eq. (20), we have: Pr(x 0 λ) Pr(x 0 ) e(q, λ) = e(q, λ) = π for any x 0 ∈ [x] ⇒ π = 1 (23) Pr(x 0 ) Pr(x 0 ) Finite index of ∼N implies |Q| < ∞, and hence denoting [λ] as q0 , we e, q0 ) is an Initial-Marked PFSA. Lemma 2 conclude: G = (Q, Σ, δ, π implies that G generates (Σω , B, µ), which completes the proof. e([x], σ) = π

The above construction yields a minimal realization for the InitialMarked PFSA, unique up to state renaming. Lemma 4 (QSP to PFSA). Any QSP with a finite index Nerode equivalence is generated by an Initial-Marked PFSA. Proof: Follows immediately from Lemma 1 (QSP to Probability Space) and Lemma 3 (Probability Space to PFSA generator). A. Canonical Representations We have defined a QSP as both ergodic and stationary, whereas the Initial-Marked PFSAs have a designated initial state. Next we introduce canonical representations to remove initial-state dependence. We e to denote the matrix representation of π e ij = π e, i.e., Π e(qi , σj ), use Π qi ∈ Q, σj ∈ Σ. We need the notion of transformation matrices Γσ . Definition 5 (Transformation Matrices). For an initial-marked PFSA e, q0 ), the symbol-specific transformation matrices Γσ ∈ G = (Q, Σ, δ, π {0, 1}|Q|×|Q| are:  e(qi , σ), if δ(qi , σ) = qj π Γσ ij = (24) 0, otherwise Transformation matrices have a single non-zero entry per row, reflecting our generation rule that given a state and a generated symbol, the next state is fixed. First, we note that, given an initial-marked PFSA G, we can associate a probability distribution ℘x over the states of G for each x ∈ Σ? in the following sense: if x = σr1 · · · σrm ∈ Σ? , then we have: m Y 1 Qm ℘x = ℘σr1 ···σrm = Γ σrj (25) ℘λ ||℘λ |

j=1 Γσrj ||1

{z

}

j=1

Normalizing factor

where ℘λ is the stationary distribution over the states of G. Note that there may exist more than one string that leads to a distribution ℘x , beginning from the stationary distribution ℘λ . Thus, ℘x is an equivalence class of strings, i.e., x is not unique. Definition 6 (Canonical Representation). An initial-marked PFSA e, q0 ) uniquely induces a canonical representation G = (Q, Σ, δ, π eC ), where QC is a subset of the set of probability (QC , Σ, δC , π eC : QC × Σ → [0, 1] distributions over Q, and δC : QC × Σ → QC , π

are constructed as follows: 1) Construct the stationary distribution on Q using the transition probabilities of the Markov Chain induced by G, and include this as the first element ℘λ of QC . Note that the transition |Q|×|Q| matrix for G , Pis the row-stochastic matrix M ∈ [0, 1] e(qi , σ), and hence ℘λ satisfies: with Mij = σ:δ(qi ,σ)=qj π ℘λ M = ℘λ (26) eC recursively: 2) Define δC and π 1 δC (℘x , σ) = ℘x Γσ , ℘xσ (27) ||℘x Γσ ||1

e eC (℘x , σ) = ℘x Π π

For a QSP H, the canonical representation is denoted as CH .

(28)

σ0 |0.85

σ1 |0.75

σ1 |0.15

q0

σ0 |0.85

q0

q1

σ0 |0.25 Synchronizable

σ0 |0.25

σ1 |0.15

q1

σ1 |0.75 Non-synchronizable

Synchronizable and non-synchronizable machines. Identifying contexts is a key step in estimating the entropy rate of stochastic signals sources; and for PFSA generators, this translates to a state-synchronization problem. However, not all PFSAs are synchronizable, e.g., while the top machine is synchronizable, the bottom one is not. Note that a history of just one symbol suffices to determine the current state in the synchronizable machine (top), while no finite history can do the same in the nonsynchronizable machine (bottom). However, we show that a -synchronizable string always exists (Theorem 1).

Fig. 2.

Lemma 5 (Properties of Canonical Representation). Given an initiale, q0 ): marked PFSA G = (Q, Σ, δ, π 1) The canonical representation is independent of the initial state. eC ) contains a copy 2) The canonical representation (QC , Σ, δC , π of G in the sense that there exists a set of states Q 0 ⊂ QC , such that there exists a one-to-one map ζ : Q → Q 0 , with: e(q, σ) = π eC (ζ(q), σ) π ∀q ∈ Q, ∀σ ∈ Σ, (29) δ(q, σ) = δC (ζ(q), σ) 3) If during the construction (beginning with ℘λ ) we encounter ℘x = ζ(q) for some x ∈ Σ? , q ∈ Q and any map ζ as defined in (2), then we stay within the graph of the copy of the initialmarked PFSA for all right extensions of x. Proof: (1) follows the ergodicity of QSPs, which makes ℘λ independent of the initial state in the initial-marked PFSA. (2) The canonical representation subsumes the initial-marked representation in the sense that the states of the latter may themselves be seen as degeneratedistributions over Q, i.e., by letting E = ei ∈ [0 1]|Q| , i = 1, · · · , |Q| (30) denote the set of distributions satisfying:  1, if i = j (31) ei |j = 0, otherwise (3) follows from the strong connectivity of G. Lemma 5 implies that initial states are unimportant; we may denote the initial-marked PFSA induced by a QSP H, with the initial marking removed, as PH , and refer to it simply as a “PFSA”. States in PH are representable as states in CH as elements of E. Next we show that we always encounter a state arbitrarily close to some element in E (See Eq. (30)) in the canonical construction starting from the stationary distribution ℘λ on the states of PH . Next we introduce the notion of -synchronization of probabilistic automata (See Figure 2), which would be of fundamental importance to our entropy estimation algorithm in the next section. Synchronization of automata is fixing or determining the current state; thus it is analogous to contexts in Rissanen’s “context algorithm” [7]. We show that not all PFSAs are synchronizable, but all are -synchronizable. Theorem 1 (-Synchronization of Probabilistic Automata). For any QSP H over Σ, the PFSA PH satisfies: ∀ 0 > 0, ∃x ∈ Σ? , ∃ϑ ∈ E, ||℘x − ϑ||∞ 5  0 (32) Proof: We show that all PFSA are at least approximately synchronizable [16], [17], which is not true for deterministic automata. If the graph of PH (i.e., the deterministic automaton obtained by removing the arc probabilities) is synchronizable, then Eq. (32) trivially holds true for  0 = 0 for any synchronizing string x. Thus, we assume the graph of PH to be non-synchronizable. From definition of non-synchronizability, it follows: ∀qi , qj ∈ Q, with qi , qj , ∀x ∈ Σ? , δ(qi , x) , δ(qj , x) (33) If the PFSA has a single state, then every string satisfies the condition in Eq. (32). Hence, we assume that the PFSA has more than one state. Now if we have: 0 Pr(x x) Pr(x 00 x) ∀x ∈ Σ? , = where [x 0 ] = qi , [x 00 ] = qj (34) 0 00 Pr(x )

Pr(x )

then, by the Definition 3 , we have a contradiction qi = qj . Hence ∃x0 such that Pr(x 0 x0 ) Pr(x 00 x0 ) , where [x 0 ] = qi , [x 00 ] = qj 0 Pr(x ) Pr(x 00 )

(35)

4

X Pr(x 0 x) = 1, for any x 0 where [x 0 ] = qi (36) Pr(x 0 ) x∈Σ? we conclude without loss of generality ∀qi , qj ∈ Q, with qi , qj : Pr(x 00 xij ) Pr(x 0 xij ) > where [x 0 ] = qi , [x 00 ] = qj ∃xij ∈ Σ? , 0 Pr(x ) Pr(x 00 ) It follows from induction that if we start with a distribution ℘ on Q such that ℘i = ℘j = 0.5, then for any  0 > 0 we can construct a finite ij ij string xij 0 such that if δ(qi , x0 ) = qr , δ(qj , x0 ) = qs , then for the new ij 0 distribution ℘ after execution of x0 will satisfy ℘s0 > 1 − 0 . Recalling that PH is strongly connected, we note that, for any qt ∈ Q, there exists a string y ∈ Σ? , such that δ(qs , y) = qt . Setting xi?,j→t = xij 0 y, we can ensure that the distribution ℘ 00 obtained after execution of xij ? satisfies ℘t00 > 1 −  0 for any qt of our choice. For arbitrary initial A distributions ℘ on Q, we must consider contributions arising from simultaneously executing x?i,j→t from states other than just qi and qj . Nevertheless, it is easy to see that executing x?i,j→t implies that 0 0 A A 0 in the new distribution ℘A , we have ℘A t > ℘i + ℘j −  . It follows 1,2→|Q| 3,4→|Q| n−1,n→|Q| that executing the stringx x ···x , where |Q| if |Q| is even n= (37) |Q| − 1 otherwise

Since :

A 00

00 ℘A |Q|

would result in a final distribution ℘ which satisfies > 1− Appropriate scaling of  0 then completes the proof. Theorem 1 induces the notion of -synchronizing strings, and guarantees their existence for arbitrary PFSA. 1 0 2 n .

Definition 7 (-synchronizing Strings). A string x ∈ Σ? is synchronizing for a PFSA if: ∃ϑ ∈ E, ||℘x − ϑ||∞ 5  (38) Theorem 1 is an existential result, and does not yield an algorithm for computing synchronizing strings (See Theorem 3). We may estimate an asymptotic upper bound on such a search. Corollary 1 (To Theorem 1). At most O(1/) strings from the lexicographically ordered set of all strings over the given alphabet need to be analyzed to find an -synchronizing string. e Proof: Theorem 1 works by multiplying entries from the Π matrix, which cannot be all identical (otherwise the states would collapse). Let the minimum difference between two unequal entries be η. Then, following the construction in Theorem 1, the length ` of the synchronizing string, up to linear scaling, satisfies: η` = O(), implying ` = O(log(1/). Hence, the number of strings to be analyzed is at most all strings of length `, where |Σ|` = |Σ|O(log(1/) = O(1/).

B. Symbolic Derivatives Computation of -synchronizing strings requires the notion of symbolic derivatives. Note that, PFSA states are not observable; we observe symbols generated from hidden states. A symbolic derivative at a given string specifies the distribution of the next symbol over the alphabet. Notation 4. We denote the set of probability distributions over a finite set of cardinality k as D (k). Definition 8 (Symbolic Count Function). For a string s over Σ, the count function #s : Σ? → N ∪ {0}, counts the number of times a particular substring occurs in s. The count is overlapping, i.e., in a string s = 0001, we count the number of occurrences of 00s as 0001 and 0001, implying #s 00 = 2. Definition 9 (Symbolic Derivative). For a string s generated by a QSP over Σ, the symbolic derivative φs : Σ? → D (|Σ| − 1) is defined: #s xσi φs (x) i = P (39) s σi ∈Σ # xσi ? s s Thus, ∀x ∈ Σ , φ (x) is a probability distribution over Σ. φ (x) is referred to as the symbolic derivative at x. e induces a probability distribution over Σ as Note that ∀qi ∈ Q, π e(qi , σ|Σ| )]. We denote this as π e(qi , ·). [e π(qi , σ1 ), · · · , π We next show that the symbolic derivative at x can be used to estimate this distribution for qi = [x], provided x is -synchronizing.

Theorem 2 (-Convergence). If x ∈ Σ? is -synchronizing, then: e([x], ·)||∞ 5a.s  ∀ > 0, lim ||φs (x) − π (40) |s|→∞

Proof: We use the Glivenko-Cantelli theorem [18] on uniform convergence of empirical distributions. Since x is -synchronizing: ∀ > 0, ∃ϑ ∈ E, ||℘x − ϑ||∞ 5  (41)  Recall that E = ei ∈ [0 1]|Q| , i = 1, · · · , |Q| denotes the set of distributions over Q satisfying:  1, if i = j i e |j = (42) 0, otherwise Let x -synchronize to q ∈ Q. Thus, when we encounter x while reading s, we are guaranteed to be distributed over Q as ℘x , where: ||℘x − ϑ||∞ 5  ⇒ ℘x = αϑ + (1 − α)u (43) where α ∈ [0, 1], α = 1 − , and u is P an unknown distribution over |Q| e(qj , ·), we note that Q. Defining Aα = αe π(q, ·) + (1 − α) j=1 uj π φs (x) is an empirical distribution for Aα , implying: e(q, ·)||∞ = lim ||φs (x) − Aα + Aα − π e(q, ·)||∞ lim ||φs (x) − π |s|→∞

|s|→∞

a.s. 0 by Glivenko-Cantelli

z

}|

{

e(q, ·)||∞ 5 lim ||φs (x) − Aα ||∞ + lim ||Aα − π |s|→∞

|s|→∞

5a.s (1 − α) (||e π(q, ·) − u||∞ ) 5a.s  This completes the proof.

C. Computation of -synchronizing Strings Next we describe identification of -synchronizing strings given a sufficiently long observed string (i.e. a sample path) s. Theorem 1 guarantees existence, and Corollary 1 establishes that O(1/) substrings need to be analyzed till we encounter an -synchronizing string. These do not provide an executable algorithm, which arises from an inspection of the geometric structure of the set of probability vectors over Σ, obtained by constructing φs (x) for different choices of the candidate string x. Definition 10 (Derivative Heap). Given a string s generated by a ? QSP, a derivative heap Ds : 2Σ → D (|Σ|− 1) is the set of probability distributions over Σ calculated L ⊂ Σ? as:  sfor a subset of ?strings s D (L) = φ (x) : x ∈ L ⊂ Σ (44) Lemma 6 (Limiting Geometry). Let us define: D∞ = lim lim? Ds (L) |s|→∞ L→Σ

(45)

If U∞ is the convex hull of D∞ , and u is a vertex of U∞ , then e(q, ·) ∃q ∈ Q, such that u = π (46) Proof: Recalling Theorem 2, the result follows from noting that any element of D∞ is a convex combination of elements from the e(q|Q| , ·)}. set {e π(q1 , ·), · · · , π Lemma 6 does not claim that the number of vertices of the convex hull of D∞ equals the number of states, but that every vertex corresponds to a state. We cannot generate D∞ since we have a finite observed string s, and we can calculate φs (x) for a finite number of x. Instead, we show that choosing a string corresponding to the vertex of the convex hull of the heap, constructed by considering O(1/) strings, gives us an -synchronizing string with high probability. Theorem 3 (Derivative Heap Approx.). For s generated by a QSP, let Ds (L) be computed with L = ΣO(log(1/)) . If for x0 ∈ ΣO(log(1/)) , φs (x0 ) is a vertex of the convex hull of Ds (L), then Prob(x0 is not -synchronizing) 5 e−|s|p0 (47) where p0 is the probability of encountering x0 in s. Proof: The result follows from Sanov’s Theorem [19] for convex set of probability distributions. If |s| → ∞, then x0 is guaranteed to be -synchronizing (Theorem 1, and Corollary 1). Denoting the number of times we encounter x0 in s as n(|s|), and since D∞ is a convex set of distributions (allowing us to drop the polynomial factor in Sanov’s bound), we apply Sanov’s Theorem to the case of finite s:   e >  5 e−n(|s|) Prob KL φs (x0 ) ℘x0 Π (48)

5

where KL(·||·) is the Kullback-Leibler divergence [20]. From the bound [21]:  1 s e e 2 5 KL φs (x0 ) ℘x Π (49) ||φ (x0 ) − ℘x0 Π|| ∞ 0 4 and n(|s|) → |s|p0 , where p0 > 0 is the stationary probability of encountering x0 in s, we conclude:  e ∞ >  5 2e− 21 |s|p0 Prob ||φs (x0 ) − ℘x0 Π|| (50) which completes the proof. III. Entropy Rate for PFSA-generated Processes Given the PFSA model, the entropy rate is easily computable. Theorem 4 (Entropy Rate For PFSA). The entropy rate H(G), in e) is given by: bits, for the QSP generated by a PFSA G = (Q, Σ, δ, π H(G) =

|Q| X

X e(qi , σj ) log π e(qi , σj ) ℘λ i π

(51)

σj ∈Σ

i=1

where the base of the logarithms is 2. Proof: Denote the QSP generated by G as X = {Xi }. Using the chain rule (See Eq. (3)), we have: n 1X H(Xi |Xi−1 , · · · , X1 ) (52) H(G) = lim n

n→∞

i=1

Since G is always at somestate q ∈ Q, we conclude that forany i: H(Xi |Xi−1 , · · · , X1 ) ∈

X 



e(q, σj ) log π e(q, σj ) : q ∈ Q π

σj ∈Σ



Furthermore, since G is strongly connected, and therefore has a unique stationary distribution ℘λ [22], the number of times state qi occurs approaches n℘λ i as n → ∞. This completes the proof. If the underlying PFSA model is not available, and we have only a symbolic stream generated by a QSP, then Eq. (51) cannot be directly employed to estimate the entropy rate. In that case, one possibility is to first infer the hidden PFSA using the algorithm reported in [12], and then estimate the entropy rate from Eq. (51). However, if we are only interested in the latter, then we do not need to infer the complete generative model; and there exists a more parsimonious approach to estimate the entropy rate directly. First, we need a lemma which bounds the deviation in entropy for deviations in the probability distribution in the discrete case. Lemma 7 (Bound on Entropy Deviation). For probability distributions p, q on a finite set Σ, we have for all  ∈ (0, 1), ||p − q||∞ 5  ⇒ |Σ| − 1 1 |H(p) − H(q)| <  0 log + (1 −  0 ) log 0  1 − 0  if  5 1 / 2 where  0 = 1 −  otherwise where H(p), H(q) are entropies for distributions p, q respectively. Proof: We have from definition: H(p) − H(q) =

X i

pi log

1 pi



X i

qi log

1 qi

We note that the function f(x) = x log x1 satisfies:   1 1 δf = log − δx (53) x ln 2 implying that perturbations of x cause maximum change in f, when x is in the neighborhood of 0, which in turn implies that deviation in entropy for a perturbed distribution p is the maximized when: p → p? = 0 · · · 0 1 upto permutations (54) Since, ||p−q||∞ 5 , the perturbed distribution q from p = p? is nonunique. We claim (Claim A), that the perturbed distribution resulting in maximum deviation, is given by:  entropy 0 0 ? 1 −  0 upto permutations q = |Σ−1| · · · |Σ− (55) 1|

 if  5 1/2 (56) 1 −  otherwise ? To establish this claim, weX first note that q satisfies the constraints: ∀i q?i > 0, q?i = 1, ||p? − q? ||∞ =  (57)

where  0 =

i

Let q 0 be a perturbation of q? , definedX as: ai = 0 qi0 = q?i + ai , with

(58)

i

satisfying the constraint: ||q 0 − p? ||∞ 5  (59) Note that the above constraint, and the definition of q? implies that: a|Σ| = 0 (60) Then we claim that for small perturbations, H(q 0 ) < H(q? ) (61) We find differential perturbations in contribution to the entropy from perturbation of each entry in q? . For terms i ∈ {1, · · · , |Σ| − 1}, we note that the perturbed term is of the form: 0 + x |Σ| − 1 g(x) = log 0 (62) |Σ| − 1  +x   1 |Σ| − 1 ⇒ δg(0) = log − ln 2 ai (63) |Σ| − 1 0 if ai is small. And the |Σ|-th term is of the form: 1 f(x) = (1 −  0 + x) log (64) 1 −   + x 1 ⇒ δf(0) = log − ln 2 a|Σ| (65) 1 − 0 if a|Σ| is small. This implies that the perturbation of entropy, for small perturbations in the distribution q? , is given by:  |Σ|− X1 |Σ| − 1 − ln 2 ai |Σ| − 1 0   i=1 1 + log − ln 2 a|Σ| (66) 1 − 0 P|Σ|−1 Noting that i=1 ai = −a|Σ| , and setting b = |Σ| − 1, we have:    ! b 1 1 log 0 − ln 2 + log δH(q? ) = − − ln 2 a|Σ| b  1 − 0 !   1 1 b 1 − 1 ln 2 + log − log a|Σ| (67) = b 1 −  0{z b  }0 | {z } | δH(q? ) =

1



log

t2

t1

We note that since |Σ| = 2, t1 5 0. Then, since  0 5 1/2, we have: 1 log 5 1 with equality for  0 = 1/2 (68) 1 − 0 b 1 And we note that b log  0 attains its minimum value of 1/b + 1/b log b at  0 = 1/2, implying:   1 δH(q? ) 5 − 1 (ln 2 − 1) 5 0 (69) b

This establishes that within the set of admissible perturbed distributions q 0 , from q? , all infinitesimally small perturbations necessarily reduce the entropy, i.e., H(q? ) attains a locally maximum value. We note that for all arbitrary admissible perturbations q 0 from q? , ||q 0 − p? ||∞ 5  and definition of  0 implies that each entry in q 0 is either always in [0, 1/2], or in [1/2, 1], and not both. Noting that each summand in the calculation of entropy is of the form x log x, which is monotonic in both intervals, we conclude that H(q? ) is indeed the globally maximum entropy within all admissible perturbations q 0 . It follows that any perturbation of q? , satisfying the constraint of Eq. (59), leads to a smaller difference of entropy from p? , which establishes claim A. Noting that: |Σ| − 1 1 |H(p? ) − H(q? )| =  0 log + (1 −  0 ) log 0 1 − 0 completes the proof. This bound on entropy deviation for ∞-norm bounded deviations in distribution will be important in the sequel. We denote this as the generalized binary entropy function B(, |Σ|). Definition 11 (Generalized Binary Entropy Function). |Σ| − 1 1 B(, |Σ|) =  0 log + (1 −  0 ) log 0  1 − 0  if  5 1 / 2 where  0 = 1 −  otherwise

(70)

Corollary 2 (To Lemma 7). Given a symbol stream generated by a

6

e), and an -synchronizing string PFSA G = (Q, Σ, δ, π x0 , we have: X lim 1 lim H(φs (x0 x)) − H(G) < B(, |Σ|) n→∞ |Σn | |s|→∞ + x∈Σn +

Proof: We first establish the following claim (Claim A): x0 is -synchronizing implies that any right extension x0 x is also synchronizing (where x ∈ Σ? ). To see this, note that x0 is synchronizing implies ∃ϑ ∈ E with: ℘x0 = αϑ + (1 − α)u, with α ∈ [0, 1], α = 1 −  (71) where u is an unknown distribution over Q. It follows that: ∀σ ∈ Σ, 1 (αϑΓσ + (1 − α)uΓσ ) ℘x0 σ = (72) ||℘x0 Γσ ||1

Now, ∀σ ∈ Σ, there is an unique ϑ ∈ E, such that ϑ = ϑΓσ , and since ||℘x0 Γσ ||1 5 1, it follows that: ∀σ ∈ Σ, ∃ϑ 0 ∈ E, such that: ℘x0 σ = α 0 ϑ 0 + additional terms , with α 0 ∈ [0, 1], α 0 = 1 −  By straightforward induction, we conclude that: ∀x ∈ Σ? , ∃ϑ(x) ∈ E, such that ||℘x0 x − ϑ(x)||∞ 5  (73) which establishes Claim A. Next we claim (Claim B) that H(G) can be written as: 1 X H(G) = lim n H (e π([x], ·)) (74) n→∞

0

|Σ+ |

0

x∈Σn +

H(G) =

℘λ i H(e π(qi , ·))

(75)

i=1

Set the initial state of G to be q ∈ Q, where x0 -synchronizes to q. For any n, and each x ∈ Σn + , [x] is the equivalence class corresponding to some qi ∈ Q. Let the number of times [x] corresponds to qi , for x ∈ Σn + , be ni . Then, uniqueness of ℘λ implies that limn→∞ ni /n = ℘λ |i , which implies: |Q| X

℘λ i H(e π(qi , ·)) = lim

n→∞

i=1

X

1 |Σn +|

π([x], ·)) H (e

(76)

x∈Σn +

establishing Claim B. Thus, we can write:



X 1 = lim n lim H(φs (x0 x)) − H(e π([x], ·)) n→∞ |Σ+ | |s|→∞ x∈Σn + 1 X s 5 lim n lim H(φ (x x)) − H(e π ([x] , ·)) 0 |s|→∞ n→∞ |Σ |

(77)

    |s|ζ(1 − r) , |s|ζ(1 + r)

Using Chernoff bounds for the probability of n ∈ Vr , we have:   s0 >  φ (x x) Pr φs (x0 x) − lim 0 |s 0 |→∞ ∞ X  −22 n 0 Pr {Nx0 x = n 0 } 5 2e n 0 ∈Ur

+

n→∞

1

X

|Σn + | x∈Σn +

(78) (79)

B(, |Σ|)

(81)

which completes the proof. Next we modify the Dvoretzky-Kiefer-Wolfowitz inequality, to be applicable to the case where the number of samples drawn is itself a random variable. Lemma 8 (DKW-bound for symbolic derivatives). For a string s generated by a PFSA, and a given -synchronizing string x0 : ∀x ∈ Σ? such that x0 x occurs in s with probability ζ > 0 ,   2 s0 >  < 8(1 + 1 )e−|s|ζ 1+2 Pr φs (x0 x) − lim φ (x x) 0 0 |s |→∞

e



Proof: We note that φs (x0 x) is an empirical distribution with 0 the limiting distribution given by lim|s 0 |→∞ φs (x0 x) , φ? . Using the DKW inequality [23], and denoting the number of occurrences of x0 x in s with the random variable Nx0 x , we have:  s φ (x0 x) − lim φs 0 (x0 x) Pr 0 |s |→∞



X

2e−2

2n0

Pr {Nx0 x = n 0 }



n 0 ∈V

5

r     r2 |s|ζ 2 2|s|ζr × 2e−2 |s|ζ(1−r) × 1 + 1 × 2e− 2+r

 

5 4 |s|ζr e−2 Denoting |s|ζ as t, we have the bound: 

2 |s|ζ(1−r)

+ 2e−

r2 |s|ζ 2+r

  r2 t 2 ∀r > 0, f(r) = 4 rt e−2 t(1−r) + 2e− 2+r

(84) (85)

We note that the two terms are equal if: r2 t

22 t(1 − r) =

+ ln(2drte) (86) 2+r It follows that if we solve for r in terms of  after dropping the non-negative log-term, then the first term would be bigger or equal compared to the second. Solving the resulting quadratic, we get: 2

(87) 1 + 2 A larger value of r makes the first term larger, and the second term 2 smaller; hence we use r = 1+ 2 , leading to the non-tight bound: r
 {Nx0 x = n 0 }

B(, |Σ|) Pr H(φs (x0 x)) − H lim φ (x x) 0 0 |s |→∞

 0, where: " #



X 1 s lim lim H(φ (x x)) − H(G) 0 n→∞ |Σn | + x∈Σn |s|→∞ +

n→∞

(82)

n 0 ∈N



To see this, note that the PFSA G is strongly connected with a unique stationary distribution ℘λ , and Theorem 4 implies: |Q| X

 2 0 5 2e−2 n Pr {Nx0 x = n 0 }   0 φs (x0 x) >  ⇒ Pr φs (x0 x) − lim |s 0 |→∞ ∞ X  2 0 5 2e−2 n Pr {Nx0 x = n 0 }

1 e

 e

2 −|s|ζ  2 1+

Proof: It follows from Lemma 7 and continuity of entropy that s φ (x0 x) − lim φs 0 (x0 x) 5  ∞ |s 0 |→∞   s0 5 B(, |Σ|) ⇒ H(φs (x0 x)) − H lim φ (x x) 0 0 |s |→∞

Using Lemma 8, we have:   2 s0 5  = 1 − 8(1 + 1 )e−|s|ζ 1+2 Pr φs (x0 x) − lim φ (x x) 0 ∞ 0 e |s |→∞     s s0 ⇒ Pr H(φ (x0 x)) − H lim φ (x x) 5 B ( , |Σ|) 0 |s 0 |→∞   2 1 −|s|ζ 1+ 2 =1−8 1+ e e

which completes the proof. Theorem 5 (Bound on Entropy Calculation with Finite Samples). e), and a given For any string x generated by a PFSA G = (Q, Σ, δ, π -synchronizing string x0 , there exist C0 , C1 depending only on the size of the alphabet |Σ|, such that, for any independently chosen set

7

of strings N j Σ? : 1 X

H(φ (x0 x)) − lim

n→∞

x∈N

! > B(, |Σ|) + 

5 C0

|Σn +|

+ 2e

|s|3

1 2 3

C1 |N|2

4 5

X

1

H(φ (x0 x)) − lim

/* I. -synchronization String Identification */

lim H(φ (x0 x)) |s|→∞ s

7 8

n→∞ |Σn |N| x∈N + | x∈Σn + X X 1 1 s 5 H(φ (x0 x)) − n lim H(φs (x0 x)) |s|→∞ |N| x∈N |Σ+ | n x∈Σ

+ X 1 X 1 lim H(φs (x0 x)) lim H(φs (x0 x)) − lim n + n→∞ |Σ | |s|→∞ |s|→∞ |N|

13

Select N ⊂ Σ? with length ` strings drawn with probability

14 15 16 17 18 19 20

/* |N| ∼ 107 log2 |Σ| sufficient for negligible uncertainty contribution */ foreach x ∈ N do if #s x0 x > Nmin then Compute u ←− φs (x0 x) /* Symbolic derivative at x0 x */ if ∃ key v ∈ Countmap s.t. ku − vk∞ 5  then Countmap [v] ←− Countmap [v] + 1 else Set Countmap [u] = 1

/* II. Entropy Rate Estimatation */

We denote the two RHS terms as B and C, and note: 1 X 1 X B , H(φs (x0 x)) − n lim H(φs (x0 x)) |Σ+ |

|s|→∞ x∈Σn +

X   |Q| ei H(φs (x0 x 0 )) − lim H(φs (x0 x 0 )) = ℘ |s|→∞

(90)

i=1

where x0 -synchronizes to q0 ∈ Q, δ(q0 , x 0 ) = qi , and ℘e is the empirical estimate of the stationary distribution. Using the bound from Corollary 3: 

Pr(B > B(, |Σ|)) < 8 1 +  ) 5 2e−2|N|

2

− 22 |N|2 log |Σ|

⇒ Pr(ke ℘ − ℘λ k1 log |Σ| 5 ) > 1 − 2e (93) Using the bounds in Eq. (91), and (93), we get: E , Pr (B + C 5 B(, |Σ|) + )     −|N|2 22 (1 + 2 )|Q| log |Σ| > 1 − 8(1 + 1/e) × 1 − 2 e 2

error [bits / letter]

|s|

|s|3

− 2e−C1 |N|

2

finite string s generated by x0 ∈ Σ? satisfying the prehave for any independently

! H(φ (x0 x)) − H(G) >  + 2B(, |Σ|)

5 C0

−C1 |N|2

−p0 |s|

+ 2e +e (95) |s|3 2 2 where C0 = (8/e + 8/e )(|Σ| − 1), C1 = 2/ log |Σ| and p0 is the non-zero occurrence probability of x0 in s.

95% confidence 99% confidence

107

108

109

1010

input length [# of symbols]

0:8

1011

1012

95% confidence

0:6

99% confidence

0:4

0:22

0:2 0

which completes the proof.

1 + 2

2

+ 2e−C1 |N|0 + e−0 p0 |s| 5 1

0:9

0:5

2 log2 |Σ|

1+

1+2 0

|s|3

(b) Alphabet Size = 2

e|s|

Theorem 6 (Main Theorem). Given a e), and a string a PFSA G = (Q, Σ, δ, π conditions described in Theorem 3, we chosen set of strings N j Σ? : 1 X s

1

0 106

Since we are using -synchronization, it follows that the number of states |Q| is upper then yields:  bounded 2by  (|Σ| − 1)/ which  1+  −C1 |N|2 E > 1 − C0 1 − 2e (94) 3

⇒Pr (A 5 B(, |Σ|) + ) > 1 − C0

*/

5  106

where x0 -synchronizes to q0 ∈ Q and δ(q0 , x ) = qi . Using DKW:

2

H(v): entropy of v

(a) Alphabet Size = 27

1:5

0

with C0 = (8/e + 8/e2 )(|Σ| − 1), and C1 =

/*

0

E ←− ? + 2B(? , |Σ|) return h, E

i=1

x∈N

Delete x from N

foreach key v ∈Countmap do  Countmap [v] × H(v) h ←− h + Counttotal

|s|→∞

X  |Q| ei − ℘λ i lim H(φs (x0 x 0 )) 5 ke ℘ − ℘λ k1 log |Σ| ℘ = |s|→∞

Pr |N|

Counttotal ←− Counttotal + 1

/* III. Uncertainty Estimation */ 27

2

|Σ+ |

1 |Σ|`

26

(1 +  )|Q| e|s|2

n→∞

21 22 23

i=1

For the second RHS term: 1 X 1 X lim H(φs (x0 x)) C, lim H(φs (x0 x)) − lim n |s|→∞

D[x] ←− φs (x)

A ←− {x 0 : D[x 0 ] is on the convex hull of the set of values in hashtable D} x0 ←− argmaxx∈A #s x /* -synchronization string */ p0 ←− (#s x0 )|s| /* Occurrence prob. of -synchronization string */

+

|N| x∈N

(1/) foreach x ∈ Σlog do +

9 10 11 12

+x∈Σn

x∈N

|N| x∈N

Input: Data sequence s over alphabet Σ, , Confidence level α Output: Entropy rate h, Uncertainty E at specified confidence level Initialize h = 0 Initialize Counttotal = 0 Initialize Countmap = ∅ /* hashtable with keys as probability distributions, and values as doubles */ Set C0 = (8/e + 8/e2 )(|Σ| − 1), C1 = 2/ log2 |Σ| Set Nmin = 10 /* Any small integer suffices (See Section IV) */

6

Proof: We note that: 1 X s

A=

x∈Σn +

1 + 2

lim H(φ (x0 x)) |s|→∞ s

error [bits / letter]

Pr |N|

Algorithm 1: Detailed pseudocode for entropy rate estimation

X

1

s

105

106

107

108

109

input length [# of symbols]

1010

Fig. 3. Uncertainty bounds for different alphabet sizes. Note for a data length of

5 × 106 , we have an uncertainty of 0.9 bits at 95% confidence for a 27 letter alphabet (plate (a)); the corresponding uncertainty for a binary alphabet is 0.22 bits (plate(b)).

Proof: It follows from Corollary 7, and Theorem 5, that: ! 1 X T , Pr H(φs (x0 x)) − H(G) 5  + 2B(, |Σ|) |N| x∈N

 >

1 − C0

1 + 2 |s|3

− 2e−C1 |N|)

2

 × Pr(x0 is -synchronizing)

Assuming x0 satisfies the pre-conditions described in Theorem 3:    1 + 2 −C1 |N|2 T > 1 − C0 − 2 e × 1 − ep0 |s| 3 |s|

8

English Text

this-paper

Shannon Estimate

0 2 input length [# of symbols

106 ]

entropy rate [bits / letter]

50 000

LZ-compression

0:8

this-paper

0:7

Theoretical value

0:6

0 j0:85

0:5

100 000

0

input length [# of symbols]

this-paper

2

Shannon Estimate

0

Theoretical value

0

state-of-art

1

0:3

4

(b) Shakespeare Collected Works 3

this-paper

0:35

(e) Non-synchronizable Stochastic process

2

4 input length [# of symbols 106 ]

3 2

(d) Map xn+1 = 1

rx2n , with r = 1:75 LZ-compression this-paper

1 0

Theoretical value

0

50 000 input length [# of symbols ]

q0

10 000

1 j0:15 1 j0:75

q1

0 j0:25

20 000

30 000

input length [# of symbols]

100 000

(f) Synchronizable Stoachastic Process entropy rate [bits / letter]

0

LZ-compression

0:01]

1

rx2n , with r = 1:7499 entropy rate [bits / letter]

entropy rate [bits / letter]

state-of-art

entropy rate [bits / letter

entropy rate [bits / letter]

(a) King James’ Bible 2

Finite Memory Stochastic Processes

Chaotic Dynamical Systems (c) Map xn+1 = 1

0:75

LZ-compression this-paper

0:7

Theoretical value

0:65

0 j0:85

0:6 0

10 000

q0

1 j0:15 0 j0:25

q1

20 000

1 j0:75 30 000

input length [# of symbols ]

Applications. Plates (a-b): Entropy rate of English text. Shannon’s experiment using human subjects puts the estimate around 1 bit per letter. We achieve very close estimates. The state-of-the-art plots are replicated from [8]. Plates (c-d): Entropy rate of sequences generated by a chaotic dynamical system and a binary generating partition. Plates (e-f): Entropy of symbol streams generated by probabilistic automata. Note that even with two states, and a binary alphabet, a non-synchronizable generating process leads to significantly larger errors with the LZ-based approaches.

Fig. 4.

> 1 − C0

1 + 2 |s|3

2

− 2e−C1 |N| − ep0 |s|

(96)

which completes the proof. Remark 1. We note the following: • For binary alphabets, we have: C0 ' 4.03, C1 = 2. • Each term on the RHS of Eq. (95) reflects a specific contribution: Dependence on Summation Depth

z }| { 2 0 |s| C0 + 2e−C1 |N| + e| −p {z } |s|3 Synchronization Error Dependence | {z }

1 + 2

(97)

Data-length Dependence

Eq. (95) bounds the maximum uncertainty at a given confidence level, which depends on the alphabet size. The uncertainty relationships for two alphabet sizes (2, 27) are shown in Figure 3. Corollary 4 (To Theorem 6). As a function of the length of the observed data string s, the upper and lower confidence bands for the estimatedentropyrate, with any fixed confidence level, converge at log |s| . a rate O 1/3 |s|

Proof: For a given confidence level k = kdata + kdepth , where kdata captures the dependence on the data length through the first RHS term in Eq. (95), we get:  1/3 C0 (1 + 2 ) 2C0 3 = ⇒< (98) |s|kdata

|s|kdata

The distance between the confidence bands is given by: B = 2 + 4B(, |Σ|) (99) and using Eq. (98), along with the definition of the generalized binary entropy function (Definition 11), completes the proof. IV. Algorithmic Implementation The algorithmic steps for the proposed entropy rate estimation technique is enumerated in Algorithm 1. The inputs to the algorithm is the data stream s, , and the confidence level α at which the error estimate is desired. Importantly, the size of the set of sampled string N is not required to be an input; if computational effort is not a concern, then the uncertainty contribution from the term involving |N| (See Eq. (95)) can be reduced to negligible levels by using a sample set with |N| ' 2K2 log2 |Σ|, which would result in uncertainty contribution

of ∼ e−K . Using |N| ' 107 log2 |Σ| is generally sufficient to make this factor negligible; smaller sets may be used under computational constraints, which would lead to increased uncertainty in the entropy estimate. Particularly rare strings may accumulate errors, which is prevented in the implementation by ignoring strings that occur too infrequently (Note Nmin in step 5 and step 15 of Algorithm 1). A. Application to English text, Chaotic systems & Random walks We demonstrate Algorithm 1 in three different applications. Our first application is the estimation of the entropy rate of English text. Shannon’s experimental approach with human subjects [24] suggests that English has an entropy of around one bit per letter. However, the large alphabet size (26 letters + space = 27), makes it computationally hard to verify this value. We apply our algorithm to relatively small corpora: the King James Bible (KJB) (which has a length ∼ 4 × 106 letters), and the collected works of Shakespeare (SHK, length ∼ 4.8 × 106 letters). These particular examples allow direct comparison against the results reported in [8]. We obtain entropy rates which are significantly closer to the Shannon estimate (See Figure 4): 1.05 bits/letter for KJB, and 1.25 bits/letter for SHK, while Schürmann et al. obtain the corresponding estimates to be 1.73 bits/letter and 2.13 bits/letter. The authors in [8] were able to improve the SHK estimate to 1.7 bits/letter using the “ansatz” mentioned before; Algorithm 1 yields an improved estimate without any such assumptions. Our second application is entropy estimation of sequences produced by chaotic dynamical systems. We use the same iteration map used in [8]: namely xn+1 = 1 − rx2n , and use a binary generating partition at x = 0. We analyze the cases r = 1.7499 (Figure 4(b)) where it is very strongly intermittent, and r = 1.75 which is the Pomeau-Manneville intermittency point (Figure 4(c)). As before, we converge faster in the non-trivial case, and gets very close to the theoretical entropy given by the positive Lyapunov exponent due to Pesin’s identity [7]. Our third application analyzes sequences generated by finite memory ergodic stationary stochastic processes, modeled directly via probabilistic automata (Figure 4(e-f)). Thus, we are looking at generalized random walks, Inspite of being somewhat more contrived compared to the first two applications, we can gain important insights

9

from this example. Even with two states, and with a binary alphabet, LZ-based approaches may perform significantly worse, particularly for short streams with long range dependencies. We note that the PFSA generator used in Figure 4(e) is non-synchronizable, i.e., no finite length of observed history tells us definitively what the current state is. Nevertheless, as we showed in Theorem 1, the machine is -synchronizable; and Algorithm 1 performs quite well, converging to the theoretical value with just under 104 symbols. In contrast, the LZ-compression based algorithm has an error of about 17% even after 3 × 104 symbols. This is discrepancy in performance disappears if the generating process is synchronizable, e.g., if a finite history tells us precisely what the current state is. Indeed with a synchronizable PFSA in Figure 4(f) (here, the last symbol is sufficient to fix the current state), the algorithms have comparable performances.

References

[5] J. Langdon, G.G., “A note on the ziv - lempel model for compressing individual sequences (corresp.),” Information Theory, IEEE Transactions on, vol. 29, no. 2, pp. 284–287, 1983. [6] P. Grassberger, “Estimating the information content of symbol sequences and efficient codes,” IEEE Trans. on Inf. Theory, vol. 35, no. 3, pp. 669– 675, 1989. [7] J. Rissanen, “A universal data compression system,” IEEE Trans. on Inf. Theory,, vol. 29, no. 5, pp. 656–664, 1983. [8] T. Schürmann and P. Grassberger, “Entropy estimation of symbol sequences,” CHAOS, vol. 6, no. 3, pp. 414–427, 1996. [9] A. Paz, Introduction to probabilistic automata (Computer science and applied mathematics). Orlando, FL, USA: Academic Press, Inc., 1971. [10] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York, NY, USA: Wiley-Interscience, 1991. [11] E. Vidal, F. Thollard, C. de la Higuera, F. Casacuberta, and R. Carrasco, “Probabilistic finite-state machines - part i,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 27, no. 7, pp. 1013–1025, July 2005. [12] I. Chattopadhyay and H. Lipson, “Abductive learning of quantized stochastic processes with probabilistic finite automata,” Philos Trans A, vol. 371, no. 1984, p. 20110543, Feb 2013. [13] J. E. Hopcroft, R. Motwani, and J. D. Ullman, Introduction to Automata Theory, Languages, and Computation, 2nd ed. Addison-Wesley, 2001. [14] I. Chattopadhyay and A. Ray, “Structural transformations of probabilistic finite state machines,” International Journal of Control, vol. 81, no. 5, pp. 820–835, May 2008. [15] R. Gavaldà, P. W. Keller, J. Pineau, and D. Precup, “Pac-learning of markov models with hidden state,” in ECML, ser. Lecture Notes in Computer Science, J. Fürnkranz, T. Scheffer, and M. Spiliopoulou, Eds., vol. 4212. Springer, 2006, pp. 150–161. [16] S. Bogdanovic, B. Imreh, M. Ciric, and T. Petkovic, “Directable automata and their generalizations - a survey,” Novi Sad Journal of Mathematics, vol. 29, no. 2, pp. 31–74, 1999. [17] M. Ito and J. Duske, “On cofinal and definite automata,” Acta Cybern., vol. 6, pp. 181–189, 1984. [18] F. Topse, “On the glivenko-cantelli theorem,” Probability Theory and Related Fields, vol. 14, pp. 239–250. [19] I. Csiszár, “Sanov property, generalized I-projection and a conditional limit theorem,” Ann. Probab., vol. 12, pp. 768–793, 1984.

[1] “A note on kolmogorov complexity and entropy,” Applied Mathematics Letters, vol. 16, no. 7, pp. 1129 – 1130, 2003. [2] J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,” IEEE Tran on Information Theory, vol. 23, no. 3, pp. 337–343, 1977. [3] ——, “Compression of individual sequences via variable-rate coding,” IEEE Tran on Inf. Theory, vol. 24, no. 5, pp. 530–536, 1978. [4] A. D. Wyner and J. Ziv, “The sliding-window Lempel-Ziv algorithm is asymptotically optimal,” Proceedings of the IEEE, vol. 82, no. 6, pp. 872–877, 1994.

[20] E. L. Lehmann and J. P. Romano, Testing statistical hypotheses, 3rd ed., ser. Springer Texts in Statistics. New York: Springer, 2005. [21] A. Tsybakov, Introduction to nonparametric estimation, ser. Springer series in statistics. [22] W. Stewart, Numerical methods for computing stationary distribution of finite irreducible Markov chains. New York: Springer, 1999. [23] P. Massart, “The tight constant in the dkw inequality,” The Annals of Probability, vol. 18, no. 3, pp. pp. 1269–1283, 1990. [24] C. E. Shannon, “Prediction and entropy of printed english,” Bell System Technical Journal, vol. 30, pp. 50–64, Jan. 1951.

V. Summary & Conclusion We delineate a new algorithm for estimating entropy rates of symbol streams, generated by hidden ergodic stationary processes. We establish the correctness of the algorithm by exploiting a connection with the theory of probabilistic automata, and that of finite measures on infinite strings. Importantly, we establish a distribution-free limit theorem. Using established results from non-parametric statistics, p we show that entropy estimate converges at the rate O(log |s|/ 3 |s|) as a function of the input data length |s|. In consequence, we are able to derive confidence bounds on the estimate, and dictate the worstcase data length required to guarantee a specified error bound at a given confidence level. Finally, we demonstrate that, in terms of data requirements, the proposed algorithm has superior performance to competing approaches, at least in the case of the chosen applications.

Suggest Documents