PhysComp96 Full paper Draft, February 23, 1997
Entropy and Compressibility of Symbol Sequences Werner Ebeling
[email protected]
Thorsten P¨oschel
[email protected]
Alexander Neiman
[email protected] Institute of Physics, Humboldt University Invalidenstr. 110, D-10115 Berlin, Germany The purpose of this paper is to investigate long-range correlations in symbol sequences using methods of statistical physics and nonlinear dynamics. Beside the principal interest in the analysis of correlations and fluctuations comprising many letters, our main aim is related here to the problem of sequence compression. In spite of the great progress in this field achieved in the work of Shannon, Fano, Huffman, Lempel, Ziv and others [1] many questions still remain open. In particular one must note that since the basic work by Lempel and Ziv the improvement of the standard compression algorithms is rather slow not exceeding a few percent per decade. One the other hand several experts expressed the idee that the long range correlations, which clearly exist in texts, computer programs etc. are not sufficiently taken into account by the standard algorithms [1]. Thus, our interest in compressibility is twofold: (i) We would like to explore how far compressibility is able to measure correlations. In particular we apply the standard algorithms to model sequences with known correlations such as, for instance, repeat sequences and symbolic sequences generated by maps. (ii) We aim to detect correlations which are not yet exploited in the standard compression algorithms and belong therefore to potential reservoirs for compression algorithms. First, the higher-order Shannon entropies are calculated. For repeat sequences analytic estimates are derived, which apply in some approximation to DNA sequences too. For symbolic strings obtained from special nonlinear maps and for several long texts a characteristic root law for the entropy scaling is detected. Then the compressibilities are estimated by using grammar representations and several standard computing algorithms. In particular we use the Lempel-Ziv compression algorithm. Further the mean square fluctuations of the composition with respect to the letter content and several characteristic scaling exponents are calculated. We show finally that all these measures are able to detect long-range correlations. However, as demonstrated by shuffling experiments, different measuring methods operate on different length scales. The algorithms based on entropy or compressibility operate mainly on the word and sentence level. The characteristic scaling exponents reflect mainly the long-wave fluctuations of the composition which may comprise a few hundreds or thousands of letters.
1
Correlation measures and investigated sequences
sis. These methods found several applications to DNA sequences [9, 14, 12] and to human writings [11, 15]. At first we study model sequences with simple construction rules (nonlinear maps as e.g. the logistic map and the circle map, stochastic sequences containing repeating blocks etc.). A repeat sequence is defined as a random (Bernoulli-type) sequence on an alphabet A1 . . . Aλ where a given word of length s written on the same alphabet is introduced ν times (ν being large). This simple type of sequences which admits most easily analytical calculations was introduced by Herzel et al. [13] for modelling the structure of DNA strings. Repeat sequences may be generated in the following way. First we define the repeat e.g. by a string of length s
Characteristic quantities which measure long correlations are dynamic entropies [2, 3, 4, 5, 6], correlation functions and mean square deviations, 1/f δ noise [7, 8], scaling exponents [9, 10], higher order cumulants [11] and, mutual information [12, 13]. Our working hypothesis which we formulated in earlier papers [3, 8], is that texts and DNA show some structural analogies to strings generated by nonlinear processes at bifurcation points. This is demonstrated here first by the analysis of the behaviour of the higher order entropies. Further analysis is based on the mapping of the text to random walk and to “fluctuating gas” models as well as on the spectral analy-
R = A1 . . . As 1
by a given sequence of length s (i.e., len(R) = s). Then we generate a Bernoulli sequence of length L0 with N0 = L0 letters on the alphabet R, A1 . . . Aλ . This sequence consists of ν repeats and N0 −ν “random” letters. Finally we replace the letters R by the defining string. In this way a string with N letters which is like a sea??? of random letters with interspersed “ordered repeats” is obtained. This string contains sν repeat letters. Going along the string, each time we meet a repeat, we first have to identify it. Let us assume we need the first sc ¿ s letters for the identification. After any identification of a repeat we know where we are in a string and know how to continue, ie. the uncertainty is decreased. The repeat sequences defined in the way described above have well defined correlations with a range of s. Beyond the distance s the correlations are destroyed by the stochastic symbols inter-dispersed between the repeats. This procedure may be continued in a hierarchical way in order to generate longer correlations and hierarchical structures with some similarity to texts. In the special case that the sequence consists only of one kind of repeats we get periodic sequences with the period s. In this case the range of correlations is infinite. Sequences with slowly decaying long correlations are obtained from the logistic map and the circular map at critical points [3, 8]. Further we study long standard information carriers as e.g. books and DNA strings and compare them with the model sequences. In particular we studied the book “Moby Dick” by Melville (L ≈ 1, 170, 200 letters), the German edition of the Bible (L ≈ 4, 423, 030 letters), Grimm’s Tales (L ≈ 1, 435, 820 letters), the “Brothers Karamazov” by Dostoevsky (L ≈ 1, 896, 000 letters), the DNA sequences of the lambda-virus (L ≈ 50, 000 letters) and that of yeast (L ≈ 300, 000 letters). In order to find out the origin of the long-range correlations we studied the effects of shuffling of long texts on different levels (letters, words, sentences, pages and chapters). The shuffled sequences texts were always compared with the original one (without any shuffling). Of course the original and the shuffled files have the same letter distribution. However only the correlations on scales below the shuffling level are conserved. The correlations (fluctuations) on higher levels which are based on the large-scale structure of texts as well as on semantic relations are destroyed by shuffling.
2
n: Hn = −
X
p(n) (A1 . . . An ) log p(n) (A1 . . . An ), (1) hn = Hn+1 − Hn
(2)
where the summation runs over all possible words of length n, i.e. over all words which could be found if the text would have infinite length. For the case of repeat sequences we can carry out analytical calculations for the entropies [13]. For the periodic case N = νs the higher order entropies Hn are constant and the uncertainties hn are zero hn = 0;
H n = Hs . . .
(3)
if n ≥ s. The lower order entropies for n ≤ s depend on the concrete structure of the individual repeat and can easily be calculated by simple counting For the case of proper repeat sequences N ≥ sν approximative formulae for the entropy are available [13]. We find, for example (in logλ units), hn = 1 − (s − sc )ν/N,
(4)
Hn = Hs + (n − s)[1 − (s − sc )ν/N ],
(5)
if s ≥ n. Our methods for the analysis of the entropy of natural sequences were in detail explained elsewhere [16]. We have shown that at least in a reasonable approximation the scaling of the entropy against the word length is given by a root law. Our best fit of the data obtained for texts on the 32-alphabet (measured in log 32 units) reads [17] √ (6) Hn ≈ 0.5 · n + 0.05 · n + 1.7 , hn ≈ 0.25 ·
√
n + 0.05 .
(7)
The dominating term is given by a root law corresponding to a rather long memory tail. We mention that a scaling law of the root type was first found by Hilberg who made a new fit for Shannons original data. We used our own data for n = 1, . . . , 26 but included Shannons result for n = 100 as well. The idea of algorithmic entropy goes back to Kolmogorov and Chaitin and is based on the idea that sequences have a “shortest representation”. The relation between the Shannon entropy and the algorithmic entropy is given by the theorems of Zvonkin and Levin. Several concrete procedures to construct a shorter representations, are used in data compression algorithms [1]. One may assume that those algorithms are still fare from being optimal, since as a rule the do not take into account
Entropy and complexity of sequences
Let A1 A2 . . . An be the letters of a given substring of length n ≤ L. Let further p(n) (A1 . . . An ) be the probability to find in a string a block with the letters A1 . . . An . Then we may introduce the entropy per block of length 2
long correlations. A rather simple algorithm for finding compressed algorithms, which was proposed by Thiele and Scheidereiter, is called grammar complexity [6]. Let us consider a word p on certain alphabet and let K(p) be the length of shortest representation of the given string. Then we define the compressibility with respect to a given algorithm by c(p) = K(p)/K0 (p)
(8)
where K0 (p) is the representation of the corresponding Bernoulli string on the same alphabet. For repeat sequences of length N we may find c(p) by using grammar representations [6]. We get the following compressed representation S → A1 . . . Ak RA1 . . . As RA1 . . . An R . . . , R → A1 . . . As .
Figure 1: Lempel–Ziv complexities (dashed line) and scaling exponents of diffusion (full line) represented on the level of shuffling.
From this representation we calculate the “grammar compressibility” K(p) ≤ (N − ν(s − 1)) log(λ + 1) + s log λ c(p) = 1 +
log(λ + 1) s − ρ(s − 1) N log λ
(9)
express it in terms of the fluctuation theory of statistical physics. Let us consider a sequence with the total length L. Then the total number of letters is N = L and the density is equal to “1”. However the number density of different symbols may fluctuate along the string. In an earlier work [11] we considered for example the fluctuating local density of blanks in “Moby Dick” and pointed out the existence of rather long wave fluctuations. We represented there the (averaged over windows of length 4,000) local frequency of the blanks (and other letters) in the text “Moby Dick” in dependence on the position along the text. The original text shows a large-scale structure extending over many windows. This reflects the fact that in some part of the texts we have many short words, e.g. in conversations (yielding the peaks of the space frequency), and in others we have more long words, e.g. in descriptions and in philosophical considerations (yielding the minima of the space frequency). The shuffled text shows a much weaker non-uniformity of the text, the lower the shuffling level, the larger is the uniformity. More uniformity means less fluctuations and more similarity to a Bernoulli sequence. For the case of DNA sequences no analogies of pages, chapters, etc. are known. Nevertheless the reaction on shuffling is similar to those of texts.
(10)
The algorithmic entropy according to Lempel and Ziv is introduced as the relation of the length of the compressed sequence (with respect to a Lempel-Ziv compression algorithm) to the original length. Explicit results obtained for the Lempel-Ziv complexities (entropies) of several sequences were given in earlier work [17] (tab. 1). The complexities in dependence on the shuffling level are graphically represented for the text Moby Dick in fig. 1. xxxxx xxxxx xxxxx xxxxx xxxxx
yyyyy yyyyy yyyyy yyyyy yyyyy
Table 1: The Lempel–Ziv complexities for several original and for the corresponding shuffled sequences.
3
In order to quantify these findings let us define the number of letters of kind k inside a substring of length l by N (k, l). In the limit l → ∞ we get the average density n(k) = lim N (k, l)/l Since we have λ different symbols we get in this way a λ-dimensional composition space. Let us now consider the fluctuations of N (k, l) as function of l. We expect that N (k, l) fluctuates around the mean value hN (k)i = nl(k). Further we assume that the mean square fluctuations scale with certain power of the mean
Mean square fluctuations and correlation functions
In this part we follow the methods proposed by Peng et al. [9, 10] and the invariant representation proposed Voss [7]. However we shall use another language: Instead of formulating the problem in terms of random walks we 3
(particle) numbers 2
h[N (k, l) − hN (k)i] i = consthN (k)i
αk
We have investigated the characteristic exponents for several typical symbol sequences which were described above [17].
(11)
The exponent is called the characteristic mean square fluctuation exponent. In an analogous way we consider the sum of the mean square fluctuations defining an exponent α by X h[N (k, l) − hN (k)i]2 i = constN α (12)
The data are summarized in tab 1. In the same way we can obtain other important statistical quantities, such as higher-order moments and cumulants of y(k, l) (see [11]). By calculations of the H¨older exponents Dq up to q = 6 we have shown that the higher order moments have (in the limits of accuracy) the same scaling behaviour as the second moment. We repeated the procedure described above for the shuffled files.
The case α(k) = 0.5 corresponds to the normal behaviour of mean square fluctuations which statistical mechanics predicts in the absence of long-range correlations. If α(k) > 0.5 we have an anomalous fluctuation behaviour which reflects the existence of long-range correlations. In this respect we may use the term “coherent fluctuations” [17]. One can easily see that the above definitions give the same numbers for the α-coefficients as the mapping to a random walk in a λ-dimensional space [11]. Let us consider in brief this procedure: Instead of the original string consisting of λ different symbols we generate λ strings on the binary alphabet (0,1) (λ = 32 for texts). In the first string we place a “1” on all positions where there is an “a” in the original string and a “0” on all other positions. The same procedure is carried out for the remaining symbols too. Then we generate random processes corresponding to these strings moving one step upwards for any “1” and remaining on the same level for any “0”. The resulting move over a distance l is called y(k, l) where k denotes the symbol. Then by defining a λ-dimensional vector space considering y(k, l) as the component k of the state vector at the (discrete) “time” l we can map the text to a trajectory. The corresponding procedure is carried out e.g. for the DNA sequences which are mapped to a random walk on a 4-dimensional discrete space. Let us study now the anomalous diffusion coefficients [10]. The mean-square displacement for symbol k is determined by ® 2 F 2 (k, l) = y 2 (k, l) − (hy(k, l)i) , (13)
The results of the calculations in dependence on the shuffling level are shown also in tab. 1. A graphical representation of the results for Moby Dick is given in fig. 1. We see from the numbers in tab. 1 and from fig. 2
Figure 2: Double–logarithmic plot of the power spectrum for “Moby Dick”. The full line corresponds to the original text, the dashed line corresponds to the text shuffled on the chapter level and the dotted line to the text shuffled on the page level. Shuffling on a level below pages destroys the low frequency branch.
where the brackets h·i mean the averaging over all initial positions. The behaviour of F 2 (k, l) for l À 1 is the focus of interest. It is expected that F (k, l) follows a power law [10]. F (k, l) ∝ lα(k) ,
Our results show that the original texts and DNA sequences show strong long-range correlations, i.e. the coefficients of anomalous diffusion are clearly different from 1/2. After the shuffling below the page level the sequences become practically Bernoullian in comparison with the original ones since the diffusion coefficients decrease to a value of about 1/2. The decrease occurs in the shuffling regime between the page level and the chapter level. For DNA sequences the characteristic level of shuffling where the diffusion coefficient goes to 1/2 is about 500–1000. Our result demonstrates that shuffling on the level of symbols, words, sentences or pages, or segments of length 500–1000 in the DNA case, destroys the long range correlations which are felt by the mean square deviations.
(14)
where α(k) is the diffusion exponent for symbol k. We note that the diffusion exponent is related to the exponent of the power spectrum [10, 11]. Besides the individual diffusion exponents for the letters we get an averaged diffusion exponent α for the state space. The comparison of the formulae given above shows the complete equivalence of the fluctuation picture and the random walk picture. Both are equivalent. 4
4
Conclusions
¨ schel, and Karl [5] Ebeling, Werner, Thorsten Po Friedrich Albrecht, “Transinformation and Word Distribution of Information-Carrying Sequences”, Int. J. Bifurcation & Chaos, 5 (1995), 51-61.
Our results show that the dynamic entropies, the compressibilities and the scaling of the mean square deviations, are appropriate measures for the long-range correlations in symbolic sequences. However, as demonstrated by shuffling experiments, different measures operate on different length scales. The longest correlations found in our analysis comprise a few hundreds or thousands of letters and may be understood as long-wave fluctuations of the composition. These correlations (fluctuations) give rise to the anomalous diffusion and to coherent fluctuations. These fluctuations comprise several hundreds or thousands of letters. There is some evidence that these correlations are based on the hierarchical organization of the sequences and on the structural relations between the levels. In other words these correlations are connected with the grouping of the sentences into hierarchical structures as the paragraphs, the pages, the chapters etc. Usually inside certain substructure the text shows a greater uniformity on the letter level. Possibly a more careful comparison of the correlations in texts and in DNA sequences may contribute to a better understanding of the informational structure of DNA in particular to their modular structure. Our results clearly demonstrate that the longest-range correlations in information carriers are of structural origin. The entropy-like measures studied in part two operate on the sentence and the word level. In some sense entropies are the most complete quantitative measures of correlation relations. This is due to the fact that the entropies include many point-correlations. On the other hand the calculation of the higher order entropies is extremely difficult and at the present moment there is no hope to extend the entropy analysis to the level of thousands of letters. Hopefully the analysis of entropies, compressibilities and scaling exponents can be developed to useful instruments for studies of the large-scale structure of information-carrying sequences and may contribute finally to the finding of improved compression algorithms.
´nez– [6] Ebeling, Werner, and Miguel Angel Jime ˜ o, “titel???”,Math. Biosci. 52 (1980), 53Montan ???. [7] Voss, Richard F., Phys. Rev. Lett., “titel???” 68 (1992), 3805–???; Voss, Richard F., “???” Fractals 2 (1994), 1–???. [8] Anishchenko, Vadim S., Werner Ebeling, and Aleksander B. Neiman, “???”, Chaos, Solitons & Fractals 4 (1994), 69–???. [9] Peng, C???. K. et al.???, “???”, Nature 356 (1992), 168–???. [10] Stanley, H. Eugene et al.???, Physica, “titel???”, A205 (1994), 214–???. [11] Ebeling, Werner, and Neiman, Aleksander, “titel???”, Physica A215 (1995), 233–???. [12] Li, Wentian, and Kaneko, Europhys. Lett. 17 (1992), 655-???. [13] Herzel, Hans-Peter, Werner Ebeling, and Armin O. Schmitt, “Entropies of biosequences: the role of repeats”, Phys. Rev. E, 50 (1994), 5061-???. [14] Peng, C???.-K. et al.???, “titel???”, Phys. Rev. E 49 (1994), 1685–???. [15] Schenkel, A???, J??? Zhang, and Zhang, Y???, “titel???”, Fractals 1, (1993), 47–???. ¨ schel, Thorsten, Werner Ebeling, and Helge [16] Po ´, “Guessing probability distributions from Rose small samples” J. Stat. Phys. 80 (1995), 1443–1452. [17] Ebeling, Werner, Aleksander Neiman, and ¨ schel, “Dynamic entropies, long-range Thorsten Po correlations and fluctuations in complex linear structures”. In: Suzuki, M???, and Kawashima, N. (eds.), Coherent Approaches to Fluctuations, World Scientific (1995), ???–???
References [1] Storer, James A., Data Compression: Methods and Theory, Computer Science Press (1988). [2] Hilberg, W., “titel???”, Frequenz 44 (1990), 243– ???; 45 (1991), 1–???. [3] Ebeling, Werner, and Gregoire Nicolis, “titel???”, Chaos, Solitons and Fractals 2 (1992), 635–???. ¨ schel, “En[4] Ebeling, Werner, and Thorsten Po tropy and long range correlations in literary English”, Europhys. Lett., 26 (1994), 241-246. 5