Apr 23, 2015 - 1 Algorithmic information measures. Central to Algorithmic Information Theory is the definition of algorithmic (Kol- mogorov-Chaitin or ...
arXiv:1504.06240v1 [cs.IT] 23 Apr 2015
A Computable Measure of Algorithmic Probability by Finite Approximations Fernando Soler-Toscano1,3 and Hector Zenil2,3 1
Grupo de L´ogica, Lenguaje e Informaci´on, Univ. Sevilla, Spain. 2 Department of Computer Science, University of Oxford, U.K. 3 Algorithmic Nature Group, LABORES, Paris, France
Abstract We study formal properties of a Levin-inspired measure m calculated from the output distribution of small Turing machines. We introduce and justify finite approximations mk that have already been used in applications as an alternative to lossless compression algorithms for approximating algorithmic (Kolmogorov-Chaitin) complexity. We provide proofs of the relevant properties of both m and mk and compare them to Levin’s Universal Distribution. Finally, we provide error estimations of mk with respect to m.
1
Algorithmic information measures
Central to Algorithmic Information Theory is the definition of algorithmic (Kolmogorov-Chaitin or program-size) complexity [11, 2]: KT (s) = min{|p|, T (p) = s}
(1)
where p is a program that outputs s running on a universal Turing machine T and |p| is the length in bits of p. The measure was first conceived to define randomness and is today the accepted objective mathematical measure of randomness, among other reasons because it has been proven to be mathematically robust [12]. In the following, we use K(s) instead of KT (s) because the choice of T is only relevant up to an additive constant (Invariance Theorem). A technical inconvenience of K as a function taking s to be the length of the shortest program that produces s is its uncomputability. In other words, there is no program which takes a string s as input and produces the integer K(s) as output. This is usually considered a major problem, but one ought to expect a universal measure of randomness to have such a property. In previous papers [6, 16] we have introduced a novel method to approximate K based on the seminal concept of algorithmic probability, introduced by 1
Solomonoff [18] and further formalized by Levin [12], who proposed the concept of uncomputable semi-measures and the so-called Universal Distribution. Levin’s semi-measure1 mT defines the so-called Universal Distribution [10], the value mT (s) being the probability that a random program halts and produces s running on a universal Turing machine T . The choice of T is only relevant up to a multiplicative constant, so we will simply write m instead of mT . It is possible to use m(s) to approximate K(s) by means of the following theorem: Theorem 1 (Algorithmic Coding Theorem [12]). There is a constant c such that | − log2 m(s) − K(s)| < c (2) That is, if a string has many long descriptions it also has a short one [3]. This theorem beautifully connects frequency to complexity—the frequency (or probability) of occurrence of a string with its algorithmic (Kolmogorov) complexity. It implies that [6] one can calculate the Kolmogorov complexity of a string from its frequency [5, 4, 19, 6], simply rewriting the formula as: K(s) = − log2 m(s) + O(1)
(3)
Thanks to this elegant connection established by (2) between algorithmic complexity and probability, our method can attempt to approximate an algorithmic probability measure by means of finite approximations using a fixed model of computation. The method is called the Coding Theorem Method (CTM) [16]. In this paper we introduce m, a computable approximation to m that can be used to approximate K by means of the algorithmic Coding theorem. Computing m(s) requires the output of a numerable infinite number of Turing machines, so we first undertake the investigation of finite approximations mk (s) that require only the output of machines up to k states. A key property of m and K is their universality: the choice of the Turing machine used to compute the distribution is only relevant up to a (multiplicative or additive) constant, independent of the objects. The computability of this measure implies its lack of universality. The same is true when using lossless compression algorithms to approximate K, but despite their not being universal and in fact estimating the classical Shannon entropy rate (they look for repeated patterns in a fixed-length window), they have many applications. In the same way, finite approximations to m have now found successful applications, such as in financial time series [21, 13], in psychology [15, 7, 8, 9], and more recently in graph theory [23], but a full investigation to explore their properties and to provide theoretical error estimations has not been undertaken. We start by presenting our Turing machine formalism (Section 2) and then show that it can be used to encode a prefix-free set of programs (Section 3). Then, in Section 4 we define a computable algorithmic probability measure m based on our Turing machine formalism and prove its main properties, both for m and for finite approximations mk . In Section 5 we 1 It is called a semi measure because, unlike probability measures, the sum is never 1. This is due to the Turing machines that never halt.
2
compute m5 , compare it with our previous distribution D(5) [16] and estimate the error in m5 as an approximation to m. We finish with some comments in Section 6.
2
The Turing machine formalism
We denote by (n, 2) the class (or space) of all n-state 2-symbol Turing machines (with the halting state not included among the n states) following the Busy Beaver Turing machine formalism as defined by Rado [14]. Busy Beaver Turing machines are deterministic machines with a single head and a single tape unbounded in both directions. When the machine enters the halting state the head no longer moves and the output is considered to comprise only the cells visited by the head prior to halting. Formally, Definition 2 (Turing machine formalism). We designate as (n, 2) the set of Turing machines with two symbols {0, 1} and n states {1, · · · , n} plus a halting state 0. These machines have 2n entries (s1 , k1 ) (for s ∈ {1, · · · , n} and k ∈ {0, 1}) in the transition table, each with one instruction that determines their behavior. Such entries are represented by (s1 , k1 ) → (s2 , k2 , d)
(4)
where s1 and k1 are respectively the current state and the symbol being read and (s2 , k2 , d) represents the instruction to be executed: s2 is the new state, k2 the symbol to write and d the direction. If s2 is the halting state 0, then d = 0, otherwise d is 1 (right) or −1 (left). Proposition 3. Machines in (n, 2) can be enumerated from 0 to (4n + 2)2n − 1 Proof. Given the constraints in Definition 2, for each transition of a Turing machine in (n, 2) there are 4n + 2 different instructions (s2 , k2 , d). These are 2 instructions when s2 = 0 (given that d = 0 is fixed and k2 can be one of the two possible symbols) and 4n instructions if s2 6= 0 (2 possible moves, n states and 2 symbols). Then, considering the 2n entries in the transition table, |(n, 2)| = (4n + 2)
2n
(5)
These machines can be enumerated from 0 to |(n, 2)| − 1. Several enumerations are possible. We can, for example, use a lexicographic ordering on the transitions (4). For the current paper, consider that some enumeration has been chosen. Thus we use τtn to denote the machine number t in (n, 2) following that enumeration.
3
3
Turing machines as a prefix-free set of programs
We show in this section that the set of Turing machines following the Busy Beaver formalism can be encoded as a prefix-free set of programs capable of generating any finite non-empty binary string. Definition 4 (Execution of a Turing machine). Let τ ∈ (n, 2) be a Turing machine. We denote by τ (i) the execution of τ over an infinite tape filled with i (a blank symbol), where i ∈ {0, 1}. We write τ (i) ↓ if τ (i) halts, and τ (i) ↑ otherwise. We write τ (i) = s iff • τ (i) ↓, and • s is the output string of τ (i), that is, s is obtained by concatenation (upon halting) of the symbols of the tape cells visited by the head of τ during the execution τ (i). As Definition 4 establishes, we are only considering machines running over a blank tape with no input. To produce a symmetrical set of strings, we consider both symbols 0 and 1 as possible blank symbols. Definition 5 (Program). A program p is a triplet hn, i, ti, where • n ≥ 1 is a natural number • i ∈ {0, 1} • 0 ≤ t < (4n + 2)2n We say that the output of p is s if, and only if, τtn (i) = s. Programs can be executed by a universal Turing machine that reads a binary encoding of hn, i, ti (Definition 6) and simulates τtn (i). Trivially, for each finite binary string s with length |s| > 0, there is a program p which outputs s. Now that we have a formal definition of programs, we show that the set of valid programs can be represented as a prefix-free set of binary strings. Definition 6 (Binary encoding of a program). Let p = hn, i, ti be a program (Definition 5). The binary encoding of p is a binary string with the following sequence of bits: • First, 1n−1 0, that is, n − 1 repetitions of 1 followed by 0. This way we encode n. • Second, a bit with value i encodes the blank symbol. • Finally, t is encoded using dlog2 (4n + 2)2n e bits.
4
The use of dlog2 (4n + 2)2n e bits to represent t ensures that all programs with the same n are represented by strings of equal size. As there are (4n + 2)2n machines in (n, 2), with these bits we can represent any value of t. The process of reading the binary encoding of a program p = hn, i, ti and simulating τtn (i) is computable, given the enumeration of Turing machines. As an example, this is the binary representation of the program h2, 0, 185i: 1
0
0
0
0
0
0
0
0
1
0
1
1
1
0
0
1
The proposed encoding is prefix-free, that is, there is no pair of programs p, p0 such that the binary encoding of p is a prefix of the binary encoding of p0 . This is because the n initial bits of the binary encoding of p = hn, i, ti determine the length of the encoding. So p0 cannot be encoded by a binary string having a different length but the same n initial bits. Proposition 7 (Programming by coin flips). Every source producing an arbitrary number of random bits generates a unique program (provided it generates at least one 0). Proof. The bits in the sequence are used to produce a unique program following Definition 6. We start by producing the first n part, by selecting all bits until the first 0 appears. Then the next bit givesi. Finally, as we know the value of n, we take the following dlog2 (4n + 2)2n e bits to set the value of t. It is possible that constructing the program in this way, the value of t is greater than the maximum (4n + 2)2n − 1 in the enumeration. In which case we associate the program with some trivial non-halting Turing machine. For example a machine with the initial transition staying at the initial state. Notice that programming by coin flips, the approach used in the proof of Proposition 7 makes longer programs (and hence Turing machines with more states) exponentially less probable than short programs, because of the initial sequence of n−1 repetitions of 1. This observation is important because when we Sk later use machines in n=1 (n, 2) to reach a finite approximation of our measure, the greater k is, the exponentially smaller the error we will be allowing.
4
A Levin-style algorithmic measure
Definition 8. Given a Turing machine A accepting a prefix-free set of programs, the probability distribution of A is defined as X
PA (s) =
p:A(p)=s
1 2|p|
(6)
where A(p) is equal to s if and only if A halts with input p and produces s. The length in bits of program p is represented by |p|.
5
If A is a universal Turing machine, PA is a Levin semi-measure. For any string s, PA (s) is the algorithmic probability of s, calculated for the machine A. Levin’s distributions are universal [12], that is, the choice of A (any of the infinite possible universal Turing machines) is only relevant up to a multiplicative constant. Definition 9 (Distribution m(s)). Let M be a Turing machine executing the programs introduced in Definition 5. Then, m(s) is defined by m(s) = PM (s). Theorem 10. For any binary string s, m(s) =
∞ X |{τ ∈ (n, 2) | τ (0) = s}| + |{τ ∈ (n, 2) | τ (1) = s}| 2n+1+dlog2 ((4n+2)2n )e n=1
(7)
Proof. By Definition 6, the length of the encoding of program p = hn, i, ti is n + 1 + dlog2 (4n + 2)2n e. It justifies the denominator of (7), as (6) requires it to be 2|p| . For the numerator, observe that the set of programs producing s with the same n value corresponds to all machines in (n, 2) producing s with either 0 or 1 as blank symbol. Note that if a machine produces s both with 0 and 1, it is counted twice, as each execution is represented by a different program (that differ only as to the i digit).
4.1
Finite approximations to m
The value of m(s) for any string s depends on the output of an infinite set of Turing machines, so we have to manage ways to approximate it. The method proposed in Definition 11 approximates m(s) by considering only a finite number of Turing machines up to a certain number of states. Definition 11 (Finite approximation mk (s)). The finite approximation to m(s) bound to k states, mk (s), is defined as mk (s) =
k X |{τ ∈ (n, 2) | τ (0) = s}| + |{τ ∈ (n, 2) | τ (1) = s}| 2n+1+dlog2 ((4n+2)2n )e n=1
Proposition 12 (Convergence of mk (s) to m(s)). X
| m(s) − mk (s) | ≤
s∈(0+1)?
6
1 2k
(8)
Proof. By (7) and (8), X | m(s) − mk (s) | = s∈(0+1)?
X
m(s) −
s∈(0+1)? ∞ X
≤ ≤
n=k+1 ∞ X
X
mk (s)
s∈(0+1)?
2(4n + 2)2n 2n+1+dlog2 ((4n+2)2n )e
2(4n + 2)2n 2n · 2 · 2log2 ((4n+2)2n ) n=k+1 ∞ X
=
n=k+1
1 1 = k n 2 2
Proposition 12 ensures that the sum of the error in mk (s) as an approximation to m(s), for all strings s, decreases exponentially with k. The question of this convergence was first broached in [4]. The bound of 1/2k has only theoretical value; in practice we can find lower bounds. In fact, the proof counts all 2(4n + 2)2n programs of size n to bound the error (and many of them do not halt). In Section 5.1 we provide a finer error calculation for m5 by removing from the count some very trivial machines that do not halt.
4.2
Properties of m and mk
Levin’s distribution is characterized by some important properties. First, it is lower semi-computable, that is, it is possible to compute lower bounds for it. Also, it is a semi-measure, because the sum of probabilities for all strings is smaller than 1. The key property of Levin’s distribution is its universality: a semi-measure P is universal if and only if for every other semi-measure P 0 there exists a constant c > 0 (that may depend only on P and P 0 ) such that for every string s, c · P (s) ≥ P 0 (s). That is, a distribution is universal if and only if it dominates (modulo a multiplicative constant) every other semi-measure. In this section we present some results pertaining to the computational properties of m and mk . Proposition 13 (Runtime bound). Given any binary string s, a machine with k states producing s runs a maximum of 2|s| · |s| · k steps upon halting or never halts. Proof. Suppose that a machine τ produces s. We can trace back the computation of τ upon halting by looking at the portion of |s| cells in the tape that will constitute the output. Before each step, the machine may be in one of k possible states, reading one of the |s| cells. Also, the |s| cells can be filled in 2|s| ways (with a 0 or 1 in each cell). This makes for 2|s| · |s| · k different possible instantaneous descriptions of the computation. So any machine may run, at
7
most, that number of steps in order to produce s. Otherwise, it would produce a string with a greater length (visiting more than |s| cells) or enter a loop. Observe that a key property of our output convention is that we use all visited cells in the machine tape. This is what gives us the runtime bound which serves to prove the most important property of mk , its computability (Theorem 14). Theorem 14 (Computability of mk ). Given k and s, the value of mk (s) is computable. Proof. According to (8) and Proposition 3, there is a finite number of machines involved in the computation of mk (s). Also, Proposition 13 sets the maximum runtime for any of these machines in order to produce s. So an algorithm to compute mk (s) enumerates all machines in (n, 2), 1 ≤ n ≤ k and runs each machine to the corresponding bound. Corollary 15. Given a binary string s, the minimum k with mk (s) > 0 is computable. Proof. Trivially, s can be produced by a Turing machine with |s| states in just s steps. At each step i, this machine writes the ith symbol of s, moves to the right and changes to a new state. When all symbols of s have been written, the machine halts. So, to get the minimum k with mk (s) > 0, we can enumerate all machines in (n, 2), 1 ≤ n ≤ |s| and run all of them up to the runtime bound given by Proposition 13. The first machine producing s (if the machines are enumerated from smaller to larger size) gives the value of k. Now, some uncomputability results of mk Proposition 16. Given k, the length of the longest s with mk (s) > 0 is noncomputable. Proof. We proceed by contradiction. Suppose that such a computable function as l(k) gives the length of the longest s with mk (s) > 0. Then ?l(k), together with the runtime bound in Proposition 13, provides a computable function that gives the maximum runtime that a machine in (k, 2) may run prior to halting. But it contradicts the uncomputability of the Busy Beaver [14]: the highest runtime of halting machines in (k, 2) grows faster than any computable function. Corollary 17. Given k, the number of different strings s with mk (s) > 0 is non-computable. Proof. Also by contradiction: If the number of different strings with mk (s) > 0 is computable, we can run in parallel all machines in (k, 2) until the corresponding number of different strings has been found. This gives us the longest string, which is in contradiction to Proposition 16.
8
Now to the key property of m, its computability, Theorem 18 (Computability of m). Given any non-empty binary string, m(s) is computable. Proof. As we argued in the proof of Corollary 15, a non-empty binary string s can be produced by a machine with |s| states. Trivially, it is then also produced by machines with more than |s| states. So for every non-empty string s, the value of m(s), according to (7), is the sum of enumerable infinite many rationals which produce a real number. A real number is computable if, and only if, there is some algorithm that, given n, returns the first n digits of the number. And this is what mk (s) does. Proposition 12 enables us to calculate the value of k such that mk (s) provides the required digits of m(s), as m(s)−mk (s) is bounded by 1/2k . The subunitarity of m and mk implies that the sum of m(s) (or mk (s)) for all strings s is smaller than one. This is because of the non-halting machines: Proposition 19 (Subunitarity). The sum of m(s) for all strings s is smaller than 1, that is, X m(s) < 1 s∈(0+1)?
Proof. By using (7), X
m(s) =
s∈(0+1)?
∞ X |{τ ∈ (n, 2) | τ (0) ↓}| + |{τ ∈ (n, 2) | τ (1) ↓}| 2n+1+dlog2 ((4n+2)2n )e n=1
(9)
but |{τ ∈ (n, 2) | τ (0) ↓}| + |{τ ∈ (n, 2) | τ (1) ↓}| is the number of machines in (n, 2) that halt when starting with a blank tape filled with 0 plus the number of machines in (n, 2) that halt when starting on a blank tape filled with 1. This number is at most twice the cardinality of (n, 2), but we know that it is smaller, as there are very trivial machines that do not halt, such as those without transitions to the halting state, so X
m(s)