Deterministic Computation of Complexity, Information and ... - CiteSeerX

38 downloads 0 Views 115KB Size Report
Given an alphabet A and prefix-free code W C A+, we define the generalized ... p E W is referred to as the T-prefix, and k E NINf as ... We assume a unit vector g ...
ISIT 1998, Cambridge, MA, USA, August 16 - August 21

Deterministic Computation of Complexity, Information and Entropy Mark R. Titchener Dept of Computer Science, The University of Auckland, Auckland, New Zealand. Email: markQtcode.auckland. ac .nz

Abstract - A new measure of string complexity [3] for finite strings is presented based on a specific recursive hierarchical string production process (c.f. [ 2 ] ) .From the maximal bound we deduce a relationship between complexity and total information content.

( ( # A , 0 , 0 , ...)

for 1 = 1

Given an alphabet A and prefix-free code W C A + , we define the generalized T-augmentation of W by: k

w:p; = U P ' ( W \ { P ) )

U

{Pk+').

(1)

is0

p E W is referred to as the T-prefix, and k E NINfas the corresponding T-expansion parameter. Applying Eqn (1) recursively starting with A , and subject to the recursive con>...,k,-i straints pl E A , pi E (Pl..--,Pi-I i = 2,.. . ,n, yields:

The right-most non-zero element position in d(') is the length of the maximal-length strings. The complexity of the strings of this length is simply CT(Z))= mi and found to be very accurately described by li(loge#A.lxl) where ( k i ,...,kn) li(n)= d u / l o g ( u ) . Conversely a lower bound for C T ( X )is) obtained by observing that for a single repeating symbol, Eqn (2) reduces to n = 1 , kl = 1x1 - 1. Thus: logz(lxl) 5 C T ( ~5 ) li()z)loge#A). We view generalized T-augmentation as a production proIn computing CT(2)as a function of file length for printed cess for the maximal length strings in A;;:::::;;'',),each having texts, for example, one finds a function closely of the form ) C is a constant for the source. the form x = p $ p k ; ' . . .p:'a, a E A. Conversely, given any C T ( Z )N L i ( v 1 ~ 1where x E A+, it is straight forward t o derive the vectors ( p i , . . . ,pn) Recognising that C represents in the context of the present model a bound on the expected compression for the source, and (k1,. . . , kn) respectively. We define our string complexity, denoted C T ( X )as , the ef- to be achieved by mapping the source string onto a maximal complexity string of equivalent complexity, we write C T ( Z )= fective number of T-augmentation steps required to generate li(& (x)x 1x1) where E is the expected entropy for the string x from A . More formally: x in nats/symbol. Thus we conclude that the complexity of a arity(k) C,(Z) = logz(k1+ 1). (2) string (in taugs) is simply the logarithmic integral of the total information E(z) x 1x1 (in nats/symbol x symbols = nats). i=l Given a file, we may easily compute C!r(x)from Eqn (2) An upper bound for CT(Z)as a function of string length and from this E ( x ) = lZC1(C~(~))/l~1. This was done for a is deduced from understanding the growth in the length number of English texts (alphabet sizes of 75 - 90 symbols) of the maximal-length strings with minimal T-augmentation. yielding entropy values ranging from 1.6-1.9 bits/char. This Let 4 E JVm, i.e. d = ( d l ,d z , . . . ,d,,. . .), denote a distribu- compares well with [l]in which an 'upper bound' of 1.75 bits tion vector of unbounded arity whose elements di E N rep- per character for full text is arrived at by constructing a word resent the number of code strings of length i in A;;:;::::;'';. trigram model from 583 million words of training text and More particularly let d(') denote the distribution resulting then estimating the cross-entropy. This correspondence is infrom exhaustive simple T-augmentation, that is, where all terpreted as corroborating evidence of the new deterministic code strings of length less than 1 are consumed in turn as theory. REFERENCES T-prefixes pi with corresponding ki = 1. We assume a unit vector g = (1,0,0, . . .) and define a shift operator a such that [l] P. F. Brown, V. J . Della Pietra, S. A. Della Pietra, J. C. Lai, and R. L. Mercer, "An Estimate of an Upper Bound for the d' = &)d- is given by: Entropy of English", Computational Linguistics, vol. 18, no. 1, pp. 3 2 4 0 , 1992. 0 for i < j d' = A. Lenipel and J. Ziv, "On the Complexity of Finite Sequences", [2] di-j for a > j

;,

son

c

(TI

-

' {

IEEE Trans. Inform. Theory, vol. 22, no. 1, pp. 75-81, January

1976. We recursively determined('), 1 = 1,2,. . . in terms of d("-') and mt-1, the value of the left-most non zero element of d('-'), [3] M. R. Titchener, "A Deterministic Theory of Complexity, Information and Entropy", in Recent Results, IEEE Information which is the number of smallest T-prefixes available of length Theory Workshop ITW-98, San Diego, February 1998. 1-1:

0-7803-5000-6/98/$10.000 1998 IEEE.

326

Suggest Documents