Holger Lange, Frank Wolf and Michael Hauhs. BITâK, University of Bayreuth. D-95440 Bayreuth, Germany. Terrestrial ecosystems are often considered as ...
Preprint, submitted to Physical Review E
Runoff as Complexity Indicator for Terrestrial Ecosystems Holger Lange, Frank Wolf and Michael Hauhs BITÖK, University of Bayreuth D-95440 Bayreuth, Germany Terrestrial ecosystems are often considered as among the structurally most complex systems that can be scientifically investigated. On the other hand, utilization techniques like forestry or water management, focussing on specific functional aspects, often follow relatively simple empirical rules, which led in the past to satisfying predictions. This apparent contradiction opens the question whether one could quantify the functional simplicity of such systems. We use time series analysis and information-theoretic methods to answer this question for an important type of such systems, natural watersheds, and for the obviously most important variable in these systems, water flow. Based on the distinction between randomness and complexity, three different types of data sets can be characterized: very ordered and simple, partly random and rather complex, or very random and again simple. This classification has implications for modeling in hydrology. It also demonstrates the presence of universal features characterizing watersheds over a wide range of scales and climates not seen before. Introduction Natural (forested) watersheds play an important role as drinking water and timber resource since centuries. They are also at first sight well-suited landscape units for experimental investigation1. They allow to measure matter fluxes which have the potential to be closed on the system's scale, perform tracer experiments, or to investigate in detail physicochemical or hydraulic properties which are expected to determine the spectrum of residence times and/or flowpaths prevailing in the system2-4. However, there appears to be only a rather loose connection between the detailed, spatially resoluted information, derived from scientific studies of within watershed processes, with soil water content or chemical composition as prototypical examples, and the highly-aggregated functional view that is characteristic for ecosystem management. Runoff from small forested watersheds is one of the few key variables that are able to bridge between these two realms of ecosystem perception. Yet, after decades of research, mechanisms of runoff generation5 and statistical models of watershed runoff are both still being discussed6,7. While ecosystem complexity appears as a convenient excuse to address the difficulties implicated in the task of conceptually reconstructing a watershed, it does not resolve the simplicity captured by empirical management models implicated in the task of valuating and utilizing an ecosystem.
We suggest to use information-theoretic quantitative measures of complexity for key functional variables such as precipitation and runoff as guidance for a more transparent use of the term complexity in connection with watersheds and for a better choice of hydrological models. Despite the large amount of data from inside watersheds, here defined as ecosystems, model attempts to reconstruct the runoff share the common properties of success on one hand and arbitrariness on the other8,9. Empirical data are far too less restrictive for transport models. Part of the problem is the observed heterogeneity within the system, not reflected by its output. A massive overparametrization of typical process-based transport models is a canonical consequence. Our starting point is the question how much complexity is contained in the output (runoff) as compared to the amount provided by the respective external input (precipitation) and should thus be incorporated into a conceptual runoff model10. We will show that information-theoretic measures constitute sensitive instruments to evaluate the quality of runoff reconstructions with statistical models and reveal a similar behavior of those systems that are inituitively regarded as ecosystems. Complexity measures The quantities calculated fall in one out of two classes: the randomness measures, with the Shannon entropy as a prototype, and the complexity measures. Members of the first class are zero for constant sequences and maximal for completely uncorrelated random sequences. Members of the second class have a minimum for both extremes and a maximum somewhere in between11. They reflect the fact that a conceptual description of both constant and random sequences is easy, and that we are used to consider data which are partly random, partly regular as more difficult to characterize, thus being more complex. Given a real-valued time series {xt }, the first step to apply these measures is to convert it to a discrete symbol sequence {st }. This can be done in many ways, the simplest one, which we also will adopt here, being to define a threshold value and to assign a 0 to each value below or equal to, a 1 to each value above the threshold. The resulting binary sequence is the basis of analysis. Then, one defines a word length L to group symbols together (in this article, L=5). Longer correlations are not quantified with our measures; the range of meaningful L values is limited due to the finite length of the time series12. The relative word frequencies piL and conditional (or transition) probabilities pijL are calculated. These are the ingredients to calculate the Rényi entropy13
2
H α ,L =
1 log2 ( ∑ ( piL ) q ) 1−α i
(1)
a special case of which (for α = 1 ) is the generalized Shannon entropy
H 1,L ≡ H L = −∑ piL log 2 piL
(2)
i
and our preferred measure for randomness, the mean information gain 14:
MIG L = H L − H L−1
(3)
Our intuitive notion of complexity is quantified by the fluctuation complexity15, which is given by
piL 2 C F ( L) = ∑ p (log 2 L ) pj i, j L ij
(4)
Another complexity measure which will be proposed here is a quantity which we will call the Rényi complexity12
C R (α , L) =
2 ( H 1/ α , L − H α ,L ) (α − 1) L ln 2
(5)
An important difference between CR and CF is that only the latter does involve transition probabilities; short-term correlational structures are invisible for CR. However, for the baseline process to which our results will be compared, the binary Bernoulli process (or biased toin coss), there is the connection
C F ( L) = lim C R (α , L) (for Bernoulli processes) α →1
(6)
Data sets We have investigated daily values for runoff from 27 different watersheds, including 20 which may be considered as "natural" forested watersheds located in Europe and the continental U.S.A., the Mississippi and the Amazon rivers, two tropical watersheds from Puerto Rico, and three urban watersheds covered to a high degree with human settlements. For 11 of these watersheds, we also analyzed daily values of precipitation.
3
The data records are between 11 and 98 years long and have essentially no gaps. Taken together, they cover a broad spectrum of sizes (less than 1 to more than 106 km2) and climatic regions. In this presentation, we include all watersheds to which we applied our analysis. Complexity analysis We present results for the calculation of MIG5 and CF (5) in Fig. 1, using the median of the value distribution as partitioning parameter. In this representation, the analytical result for the Bernoulli process constitutes a limit curve, i.e. the highest CF value possible for a given MIG. To investigate heterogeneities within a series, windows of a fixed length (4 years) have been defined and moved across the data set, leading to the standard deviations given in the Figure. Surprisingly, almost all watersheds appear to lie on a single one-parametric curve of the form (5), the best fit value for the exponent being α opt ≈ 1.28 . Knowing MIG of a watershed data set allows to guess the corresponding CF value with good accuracy. Five groups of data sets can be identified from Fig. 1: 1. The Amazon river streamflow is by far the least random data set on short time scales. Within a few days, its dynamics is simple compared to the other data sets (short-term prediction is easy). This is surely related to the fact that the Amazon river basin is the largest worldwide; however, in general there is no correlation between size or integrated runoff volume and MIG values in the data set. 2. The forested watersheds located within temperate climate regions are characterized by medium randomness and high or very high complexity. All data points lie to the left of the maximum of the best fit curve. 3. The urban watersheds have higher randomness and do not fully achieve the complexity predicted by C R (α opt ) . Excluding these three time series from the analysis enhances the explained variance coefficient of the fit from 70% to 85%. 4. The two tropical watersheds are more random and less complex than all the others. They receive a huge amount of precipitation (more than 4000 mm per year) and biological modifications (e.g. through transpiration of trees and other plants) are much less important than at the other forested sites. 5. Precipitation signals are characterized by extremely high randomness, although significantly different from a white noise process (given by MIG = 1 and C F = 0 ). At any given time resolution, the information content of precipitation is higher than that of the corresponding runoff. To evaluate the quality of the fit to a single CR curve further, we have calculated local Rényi complexity exponents, using the fact that once the values for MIG (3) and CF (4)
4
of the data set are known, the corresponding CR and thus α are implicitly given by (6), (1) and (5). Results of these calculations are shown in Fig. 2. For very low ( MIG ≈ 0 ) or very high ( MIG ≈ 1 ) randomness values, calculation of α is not meaningful, as all C R (α , L) curves collapse into one. However, Fig. 2 shows that for the central part of Fig. 1, essentially all local α values are within errors compatible to a single unique value. The urban watersheds do not fit into this picture so well, as expected from Fig. 1. Reconstructing runoff data and mean information gain It is well-known that in many cases, runoff or streamflow value distributions may be described by a lognormal distribution. According to a recent overview6, typical runoff time series (at least on an annual basis) may be well approximated by autoregressive lognormal processes (AR(1)-LN models) defined by
ln xt = µ + ρ1 (ln xt −1 − µ ) + σ 1 − ρ12 ε t
(7)
where xt is the original series, µ, σ and ρ1 are the mean, standard deviation and lag-one correlation coefficient of the logarithmized series (which may be taken from the original series or corrected for biases6), and εt is Gaussian noise of zero mean and unit variance. Equation (7) may be generalized to higher orders (AR(p)-LN models with p ≥ 2 ). We have used one specific example, the watershed Lange Bramke (Harz, Germany), to investigate the quality of the model (7) and its p = 2 version. It turns out that the observed value distribution is indeed reproduced quite well, slightly better for p = 1 than for p = 2 . The assumption of a lognormal distribution seems justified, and (7) is a reconstruction of the original time series which does much better than simply to reproduce the mean and standard deviation to which it was fitted. However, if we take the synthetic AR(1)-LN or AR(2)-LN data set and calculate their MIGs, the situation changes. In Fig. 3, we show MIG values as a function of the partitioning parameter for the original series and the two fitted reconstructions. The values differ drastically, indicating that the short-term correlational or randomness structure of the original series is not reproduced by the model (7). Repeating the construction of the synthetic data set leads to error bars (which are small) given in Fig. 1. There is no reason to expect that higher p would provide better reconstructions; as an example, the position of an AR(10)-LN reconstruction is indicated in Fig. 1. The case p = 0 (no memory included) is worse than p = 2 .
5
Conclusions Complexity analysis provides a sensitive tool to characterize the short-term structure of data sets. The methodology may be used as stringent test on the quality of model reconstructions of observed series. In our case, we conclude that AR(1)-LN models of watershed runoff fail to reproduce important structure present in the original data. It seems that the complexity/randomness relationship of runoff from ecosystems is given by a simple one-parametric curve. One may speculate that the existence of this unique curve indicates that the data sets are different (spatial) realizations of the same process; the time series share a common property not easily expressible in other terms. The notoriously higher information content of precipitation is the reason why forward mapping (input to output) requires only a few empirical parameters, whereas inverse methods attempting to identify the physical character of such parameters suffer from an ill-posed modeling problem. The actual position of a given watershed along the curve of Fig. 1 may be linked to sitespecific properties and has implications for its predictability, naturalness, or degree of biological influence. Thus, it has the function of a complexity indicator.
References 1. Likens, G.E. and Bormann, H.F. (1995): Biogeochemistry of a forested ecosystem. Springer, New York. 2. Hultberg, H. and Skeffington, R.A. (eds.) (1998): Experimental Reversal of Acid Rain Effects: The Gårdsjön Roof Project. Wiley, Chichester. 3. Moldan, B. and Cerny, J. (1994): Biogeochemistry of small catchments. Wiley, Chichester. 4. Swank, W.T. (1988): Forest hydrology and ecology at Coweeta. Springer, New York. 5. Rodriguez-Iturbe, I. and Rinaldo, A. (1997): Fractal River Basins. Cambridge University Press, Cambridge. 6. Vogel, R.M., Tsai, Y., and Limbrunner, J.F. (1998): The regional persistence and variability of annual streamflow in the United States. Water Resources Research 34, 3445-3459. 7. Lu, Z.-Q., and Berliner, L.M. (1999): Markov switching time series models with application to a daily runoff series. Water Resources Research 35, 523-534. 8. Beven, K. (1993): Prophecy, reality and uncertainty in distributed hydrological modelling. Advances in Water Resources 16, 41-51. 9. Beven, K. (1996): The limits of splitting: Hydrology. The Science of the Total
6
Environment 183, 89-97. 10. Jakeman, A.J. and Hornberger, G.M. (1993): How Much Complexity is Warranted in a Rainfall-Runoff Model? Water Resources Research 29, 2637-2649. 11. Grassberger, P. (1986): Toward a Quantitative Theory of Self-Generated Complexity. International Journal of Theoretical Physics 25, 907-938. 12. Wolf, F. (1999): Berechnung von Information und Komplexität in Zeitreihen – Analyse des Wasserhaushalts von bewaldeten Einzugsgebieten. Dissertation, Univ. Bayreuth (in German). 13. Rényi, A. (1976): Some fundamental questions of information theory. In: Selected Papers of Alfred Rényi, vol. 2, 526-552. Akademii Kiado, Budapest. 14. Wackerbauer, R., Witt, A., Atmanspacher, H., Kurths, J. and Scheingraber, H. (1994): A Comparative Classification of Complexity Measures. Chaos, Solitons & Fractals 4, 133 - 173. 15. Bates, J. E. and Shephard, H. K. (1993): Measuring complexity using information fluctuation. Physics Letters A 172, 416 - 425.
7
C o m p le x ity d ia g ra m fo r 2 7 C a tc h m e n ts $&%('*),+ -&+ .0/1.2+ 354 6 7 4 3 88 9: ) 3 ; < ; .' = ; > 6 7 4 3 88 97 %? / 4 ; < ; .' = ; > @ ' % 4 3 7 A A +B A + = +. 6C P Q R αS T DFU V EHWGJX IHKMY L&Z [ NH\ O W ] ^ _
L a n g e B ram ke
A R (1 )-L N L B
M is s is s ip p i
A R (2 )-L N L B
#" !
A R (1 0 )-L N
A R (3 )-L N
Am azo n
Figure 1: Complexity diagram for runoff from 27 watersheds and precipitation from 11 watersheds. Binary value partition with the median as threshold. Mean values and standard deviation for windows of 4 years each. The curves corrspond to the binary Bernoulli process and the Rényi complexity (eq. (5) in text) fitted to the data. Also included are several reconstruction attempts for the Lange Bramke (LB) data set. Information on the data sets can be obtained elsewhere12.
8
αlocal for Renyi Complexity
`da l `(a k0c `da k `(a cdc `da c `(a ijc `da i `(a hgc `da h `(a fgc `da f `(a `ec `da ` `(a b0c `
w x y w |t} ~ y w y s w x y w y w y w y w y w y w y w yM s w x y w x y w y w y w y w ~ y n n e 2 2 g 2( rts n u v q s o q { g q s~ q p q o | ¢ n { £ z { z o { ¥ z s z n r z o z z o p s n o v ~ z ¤ m n o p q v q rts v
rts v
rJs v rts v
rJs v rts v
rJs v rts v
{ ¡ | s} sn sn n s n s n s n s n s n s { n { n { ~ o e e e e e J o o J o o x q n Figure 2: Local exponents for the Rényi complexity. Error bars result from the ones in Fig.1 Only the central part ( 0.25 ≤ MIG ≤ 0.7 ) is shown. The best fit value for all the data sets is given as straight line.
9
Á ¼ ´ ¹  ¹ à ¸ µ » ´ ¶ ·¸ ¹ Ä ´ · ¹ à ¸ µ Å Æ Ç È É » ¸ ¿ ¼ ÊË Ã ·¶ ¶ ¼ ¿ ¶ ¸ È Ì Æ Í ¹ ¸ à à ½ Î È Ï Ð Ñ Ò Ó Ó Ô º ¸ · ¹ ¶ Ë À ¦ §¯
» ¼ ´ Ë Í µ¼ ¿ Å Æ ÇÈ É ½º Õ Ñ À Å Æ ÇÈ É ½º Õ Ó À
¦ §® ¦ § ¦ §¬ ² ¦ §« °± ¦ §ª ¦ §© ¦ §¨ ¦ ¦
¦ §¬
Á ¼ ¿ ·´ ¹
¨ §¬ ³ ´ µ ¶ ·¶ ·¸ ¹ º ´ µ ´ » ¼ ¶ ¼ µ ½ » » ¾ ¿ À
Á ¼ ´ ¹ ©
© §¬
Figure 3: Mean Information Gain as a function of the partition parameter. The original data set is much less random than any AR(p)-LN reconstruction. Median and Mean of the original series are indicated, exhibiting the asymmetry of runoff data.
10