Rank dynamics of word usage at multiple scales

2 downloads 0 Views 4MB Size Report
Feb 20, 2018 - As Figure 2 shows, rank diversity d(k) tends to grow with the scale N since, as N increases, it is less ... It is interesting to note that for Romance languages (Spanish, French, and Italian), σ increases when ...... amour–de–dieu.
Rank Dynamics of Word Usage at Multiple Scales 2 , Fernanda ´ Jose´ A. Morales 1,2 , Ewan Colman 3,4 , Sergio Sanchez 1 , Carlos Pineda 2,5 , Gerardo Iniguez 6,7 , Germinal Cocho 2 , ´ ˜ Sanchez-Puig

Jorge Flores 2 , and Carlos Gershenson 3,4,8,9,∗

arXiv:1802.07258v1 [physics.soc-ph] 20 Feb 2018

1 Facultad

´ ´ ´ de Ciencias, Universidad Nacional Autonoma de Mexico, 01000 Mexico D.F., Mexico 2 Instituto de F´ısica, Universidad Nacional Autonoma ´ ´ ´ de Mexico, 01000 Mexico D.F., Mexico 3 Instituto de Investigaciones en Matematicas ´ Aplicadas y en Sistemas, Universidad ´ ´ ´ Nacional Autonoma de Mexico, 01000 Mexico D.F., Mexico 4 Centro de Ciencias de la Complejidad, Universidad Nacional Autonoma ´ ´ de Mexico, ´ 01000 Mexico D.F., Mexico 5 Faculty of Physics, University of Vienna, 1090 Wien, Austria 6 Next Games, 00530 Helsinki, Finland 7 Department of Computer Science, Aalto University School of Science, 00076 Aalto, Finland 8 SENSEable City Lab, Massachusetts Institute of Technology, 02139 Cambridge, MA, USA 9 ITMO University, 199034 St. Petersburg, Russian Federation Correspondence*: Carlos Gershenson [email protected]

ABSTRACT The recent dramatic increase in online data availability has allowed researchers to explore human culture with unprecedented detail, such as the growth and diversification of language. In particular, it provides statistical tools to explore whether word use is similar across languages, and if so, whether these generic features appear at different scales of language structure. Here we use the Google Books N -grams dataset to analyze the temporal evolution of word usage in several languages. We apply measures proposed recently to study rank dynamics, such as the diversity of N -grams in a given rank, the probability that an N -gram changes rank between successive time intervals, the rank entropy, and the rank complexity. Using different methods, results show that there are generic properties for different languages at different scales, such as a core of words necessary to minimally understand a language. We also propose a null model to explore the relevance of linguistic structure across multiple scales, concluding that N -gram statistics cannot be reduced to word statistics. We expect our results to be useful in improving text prediction algorithms, as well as in shedding light on the large-scale features of language use, beyond linguistic and cultural differences across human populations. Keywords: Culturomics, N-grams, language evolution, rank diversity, complexity

1

Morales et al.

1

Rank Dynamics of Word Usage

INTRODUCTION

The recent availability of large datasets on language, music, and other cultural constructs has allowed the study of human culture at a level never possible before, opening the data-driven field of culturomics (Lieberman et al., 2007; Michel et al., 2011; Dodds et al., 2011; Serr`a et al., 2012; Blumm et al., 2012; Sol´e et al., 2013; Tadi´c et al., 2013; Gerlach and Altmann, 2013; Perc, 2013; Febres et al., 2015; Wagner et al., 2014; Pi˜na-Garcia et al., 2016; Pi˜na-Garc´ıa et al., 2018). In the social sciences and humanities, lack of data has traditionally made it difficult or even impossible to contrast and falsify theories of social behaviour and cultural evolution. Fortunately, digitalized data and computational algorithms allow us to tackle these problems with a stronger statistical basis (Wilkens, 2015). In particular, the Google Books N -grams dataset (Michel et al., 2011; Wijaya and Yeniterzi, 2011; Petersen et al., 2012b,a; Perc, 2012; Acerbi et al., 2013; Ghanbarnejad et al., 2014; Dodds et al., 2015; Gerlach et al., 2016) continues to be a fertile source of analysis in culturomics, since it contains an estimated 4% of all books printed throughout the world until 2009. From the 2012 update of this public dataset, we measure frequencies per year of words (1-grams), pairs of words (2-grams), up until N -grams with N = 5 for several languages, and focus on how scale (as measured by N ) determines the statistical and temporal characteristics of language structure. We have previously studied the temporal evolution of word usage (1-grams) for six Indo-European languages: English, Spanish, French, Russian, German, and Italian, between 1800 and 2009 (Cocho et al., 2015). We first analysed the language rank distribution (Zipf, 1932; Newman, 2005; Baek et al., 2011; Corominas-Murtra et al., 2011), i.e. the set of all words ordered according to their usage frequency. By making fits of this rank distribution with several models, we noticed that no single functional shape fits all languages well. Yet, we also found regularities on how ranks of words change in time: Every year, the most frequent word in English (rank 1) is ‘the’, while the second most frequent word (rank 2) is ‘of ’. However, as the rank k increases, the number of words occupying the k-th place of usage (at some point in time) also increases. Intriguingly, we observe the same generic behaviour in the temporal evolution of performance rankings in some sports and games (Morales et al., 2016). To characterize this generic feature of rank dynamics, we have proposed the rank diversity d(k) as the number of words occupying a given rank k across all times, divided by the number T of time intervals considered (for Cocho et al. (2015), T = 210 intervals of one year). For example, in English d(1) = 1/210, as there is only one word (‘the’) occupying k = 1 every year. The rank diversity increases with k, reaching a maximum d(k) = 1 when there is a different word at rank k each year. The rank diversity curves of all six languages studied can be well approximated by a sigmoid curve, suggesting that d(k) may reflect generic properties of language evolution, irrespective of differences in grammatical structure and cultural features of language use. Moreover, we have found rank diversity useful to estimate the size of the core of a language, i.e. the minimum set of words necessary to speak and understand a tongue (Cocho et al., 2015). In this work, we extend our previous analysis of rank dynamics to N -grams with N = 1, 2, ...5 between 1855 and 2009 (T = 155) for the same six languages, considering the first 11, 140 ranks in all 30 datasets (to have equal size and avoid potential finite-size effects). In the next section, we present results for the rank diversity of N -grams. We then compare empirical digram data with a null expectation for 2-grams that are randomly generated from the monogram frequency distribution. Results for novel measures of change probability, rank entropy, and rank complexity follow. Next, we discuss the implications of our results, from practical applications in text prediction algorithms, to the emergence of generic, large-scale features of language use despite the linguistic and cultural differences involved. Details of the methods used close the paper. This is a provisional file, not the final typeset article

2

Morales et al.

2

Rank Dynamics of Word Usage

RESULTS

2.1

Rank Diversity of N -gram usage

Figure 1 shows the rank trajectories across time for selected N -grams in French, classified by value of N and their rank of usage in the first year of measurement (1855). The behaviour of these curves is similar for all languages: N -grams in low ranks (most frequently used) change their position less than N -grams in higher ranks, yielding a sigmoid rank diversity d(k) (Figure 2). Moreover, as N grows, the rank diversity tends to be larger, implying a larger variability in the use of particular phrases relative to words. To better grasp how N -gram usage varies in time, Tables S1-S30 in the Supplementary Information list the top N -grams in several years for all languages. We observe that the lowest ranked N -grams (most frequent) tend to be or contain function words (articles, prepositions, conjunctions), since their use is largely independent of the text topic. On the other hand, content words (nouns, verbs, adjectives, adverbs) are contextual, so their usage frequency varies widely across time and texts. Thus, we find it reasonable that top N -grams vary more in time for larger N . As Figure 2 shows, rank diversity d(k) tends to grow with the scale N since, as N increases, it is less probable to find N -grams with only function words (especially in Russian, which has no articles). For N = 1, 2 in some languages, function words dominate the top ranks, decreasing their diversity, while the most popular content words (1-grams) change rank widely across centuries. Thus, we expect the most 1 frequent 5-grams to change relatively more in time (for example, in Spanish, d(1) is 155 for 1-grams and 7 15 37 2-grams, 155 for 3-grams, 155 for 4-grams, and finally 155 for 5-grams). Overall, we observe that all rank diversity curves can be well fitted by the sigmoid curve 1 Φµ,σ (log10 k) = √ σ 2π

Z

log10 k

e



(y−µ)2 2σ 2

dy,

(1)

−∞

where µ is the mean and σ the standard deviation of the sigmoid, both dependent on language and N value (Table 1). In Figure 3 we see the fitted values of µ and σ for all datasets considered. In all cases µ decreases with N , while in most cases σ increases with N , roughly implying an inversely proportional relation between µ and σ. It is interesting to note that for Romance languages (Spanish, French, and Italian), σ increases when moving from 3-grams to 5-grams, while for Germanic languages (English and German) and Russian (a Slavic language), there is a decrease in σ from N = 3 to N = 4. 2.2

Null model: random shuffling of monograms

In order to understand the dependence of language use — as measured by d(k) — on scale (N ), we can ask whether the statistical properties of N -grams can be deduced exclusively from those of monograms, or if the use of higher-order N -grams reflects features of grammatical structure and cultural evolution that are not captured by word usage frequencies alone. To approach this question, we consider a null model of language in which grammatical structure does not influence the order of words. We base our model on the idea of shuffling 1-gram usage data to eliminate the grammatical structure of the language, while preserving the frequency of individual words (more details in Methods, Section 4.2). 2.2.1

Rank diversity in null model

As can be seen in Figure 4, the rank diversity of digrams constructed from shuffled monograms is generally lower than for the non-shuffled digrams, although it keeps the same functional shape of Frontiers

3

Morales et al.

Rank Dynamics of Word Usage

Equation (1) (see fit parameters in Table 1). In the absence of grammatical structure, the frequency of each 2-gram is determined by the frequencies of its two constituent 1-grams. Thus, combinations of high frequency 1-grams dominate the low ranks, including some that are not grammatically valid — e.g. ’the the’, ’the of’, ’of of’ — but are much more likely to occur than most others. Moreover, the rank diversity of such combinations is lower than we see in the non-shuffled data because the low ranked 1-grams that create these combinations are relatively stable over time. Thus, we can conclude that the statistics of higher order N -grams is determined by more than word statistics, i.e. language structure matters at different scales. 2.2.2 z-scores in null model The amount of structure each language exhibits can be quantified by the z-scores of the empirical 2-grams with respect to the shuffled data. Following its standard definition, the z-score of a 2-gram is a measure of the deviation between its observed frequency in empirical data and the frequency we expect to see in a shuffled dataset, normalized by the standard deviation seen if we were to shuffle the data and measure the frequency of the 2-gram many times (see Section 4.2 for details). The 2-grams with the highest z-scores are those for which usage of the 2-gram accounts for a large proportion of the usage of each of its two constituent words. That is, both words are more likely to appear together than they are in other contexts (for example, ‘led zeppelin’ in the Spanish datasets), suggesting that the combination of words may form a linguistic token that is used in a similar way to an individual word. We observe that the majority of 2-grams have positive z-scores, which simply reflects the existence of non-random structure in language (Figure 5). What is more remarkable is that many 2-grams, including some at low ranks (‘und der’, ‘and the’, ‘e di’), have negative z-scores; a consequence of the high frequency and versatility of some individual words. After normalizing the results to account for varying total word frequencies between different language datasets, we see that all languages exhibit a similar tendency for the z-score to be smaller at higher ranks (measured by the median; this is not the case for the mean). This downward slope can be explained by the large number of 2-grams that are a combination of one highly versatile word, i.e. one that may be combined with a diverse range of other words, with relatively low frequency words (for example ’the antelope’). In such cases, z-scores decrease with rank as z ∼ k −1/2 (see Section 4.2). 2.3

Next-word entropy

Motivated by the observation that some words appear alongside a diverse range of other words, whereas others appear more consistently with the same small set of words, we examine the distribution of next-word entropies. Specifically, we define the next-word entropy for a given word i as the (non-normalized) Shannon entropy of the set of words that appear as the second word in 2-grams for which i is the first. In short, the next-word entropy of a given word quantifies the difficulty of predicting the following word. As shown in Figure 6, words with higher next-word entropy are less abundant than those with lower next-word entropy, and the relationship is approximately exponential. 2.4

Change probability of N -gram usage

To complement the analysis of rank diversity, we propose a related measure: the change probability p(k), i.e. the probability that a word at rank k will change rank in one time interval. We calculate it for a given language dataset by dividing the number of times elements change for given rank k by the number of temporal transitions, T − 1. The change probability behaves similarly to rank diversity in some cases. For example, if there are only two N -grams that appear with rank 1, d(1) = 2/155. If one word was ranked This is a provisional file, not the final typeset article

4

Morales et al.

Rank Dynamics of Word Usage

first until 1900 and then a different word became first, there was only one rank change, thus p(1) = 1/154. However, if the words alternated ranks every year (which does not occur in the datasets studied), the rank diversity would be the same, but p(1) = 1. Figure 7 shows the behavior of the change probability p(k) for all languages studied. We see that p(k) grows faster than d(k) for increasing rank k. The curves can also be well fitted with the sigmoid of Equation (1) (fit parameters in Table 2). Figure 8 shows the relationship between µ and σ of the sigmoid fits for the change probability p(k). As with the rank diversity, µ decreases with N for each language, except for French and German between 3-grams and 4-grams. However, the σ values seem to have a low correlation with N . We also analyze the difference between rank diversity and change probability, d(k) − p(k) (Figure S1). As the change probability grows faster with rank k, the difference becomes negative and then grows together with the rank diversity. For large k, both rank diversity and change probability tend to one, so their difference is zero. 2.5

Rank Entropy of N -gram usage

We can define another related measure: the rank entropy E(k). Based on Shannon’s information, it is simply the normalized information for the elements appearing at rank k during all time intervals (see Methods). For example, if at rank k = 1 only two N -grams appear, d(1) = 2/155. Information is maximal when the probabilities of elements are homogeneous, i.e. when each N -gram appears half of the time, as it is uncertain which of the elements will occur in the future. However, if one element appears only once, information will be minimal, as there will be a high probability that the other element will appear in the future. As with the rank diversity and change probability, the rank entropy E(k) also increases its value with rank k, even faster in fact, as shown in Figure 9. Similarly, E(k) tends to be higher as N grows, and may be fitted by the sigmoid of Equation (1) at least for high enough k (see fit parameters in Table 3) Notice that since rank entropy in some cases has already high values at k = 1, the sigmoids can have negative µ values. The µ and σ values are compared in Figure 10. The behavior of these parameters is more diverse than for rank diversity and change probability. Still, the curves tend to have a “horseshoe” shape, where µ decreases and σ increases up to N ≈ 3, and then µ slightly increases while σ decreases. 2.6

Rank Complexity of N -gram usage

Finally, we define the rank complexity C(k) as C(k) = 4E(k)(1 − E(k)).

(2)

This measure of complexity represents a balance between stability (low entropy) and change (high entropy) (Gershenson and Fern´andez, 2012; Fern´andez et al., 2014; Santamar´ıa-Bonfil et al., 2016). So complexity is minimal for extreme values of the normalized entropy [E(k) = 0 or E(k) = 1] and maximal for intermediate values [E(k) = 0.5]. Figure 11 shows the behaviour of the rank complexity C(k) for all languages studied. In general, since E(k) ≈ 0.5 for low ranks, the highest C(k) values appear for low ranks and decrease as E(k) increases. C(k) also decreases with N . Moreover, C(k) curves reach values close to zero when E(k) is close to one: around k = 102 for N = 5 and k = 103 for N = 1, for all languages.

Frontiers

5

Morales et al.

3

Rank Dynamics of Word Usage

DISCUSSION

Our statistical analysis suggests that human language is an example of a cultural construct where macroscopic statistics (usage frequencies of N -grams for N > 1) cannot be deduced from microscopic statistics (1-grams). Since not all word combinations are valid in the grammatical sense, in order to study higher-order N -grams, the statistics of 1-grams are not enough, as shown by the null model results. In other words, N -gram statistics cannot be reduced to word statistics. This implies that multiple scales should be studied at the same time to understand language structure and use in a more integral fashion. We conclude not only that semantics and grammar cannot be reduced to syntax, but that even within syntax, higher scales (N -grams with N > 1) have an emergent, relevant structure which cannot be exclusively deduced from the lowest scale (N = 1). While the alphabet, the grammar, and the subject matter of a text can vary greatly among languages, unifying statistical patterns do exist, and they allow us to study language as a social and cultural phenomenon without limiting our conclusions to one specific language. We have shown that despite many clear differences between the six languages we have studied, each language balances a versatile but stable core of words with less frequent but adaptable (and more content-specific) words in a very similar way. This leads to linguistic structures that deviate far from what would be expected in a random ‘language’ of shuffled 1-grams. In particular, it causes the most commonly used word combinations to deviate further from random that those at the other end of the usage scale. If we are to assume that all languages have converged on the same pattern because it is in some way ‘optimal’, then it is perhaps this statistical property that allows word combinations to carry more information that the sum of their parts; to allow words to combine in the most efficient way possible in order to convey a concept that cannot be conveyed through a sequence of disconnected words. The question of whether or not the results we report here are consistent with theories of language evolution (Nowak and Krakauer, 1999; Cancho and Sol´e, 2003; Baronchelli et al., 2006) is certainly a topic for discussion and future research. Apart from studying rank diversity, in this work we have introduced measures of change probability, rank entropy, and rank complexity. Analytically, the change probability is simpler to treat than rank diversity, as the latter varies with the number of time intervals considered (T ), while the former is more stable (for a large enough number of observations). Still, rank diversity produces smoother curves and gives more information about rank dynamics, since the change probability grows faster with k. Rank entropy grows even faster, but all three measures [d(k), p(k), and E(k)] seem related, as they tend to grow with k and N in a similar fashion. Moreover, all three measures can be relatively well fitted by sigmoid curves (the worst fit has e = 0.02, as seen in Table 1-3). Our results suggests that a sigmoid functional shape fits rank diversity the best for low ranks, as the change probability and rank entropy have greater variability in that region. In Cocho et al. (2015), we used the parameters of the sigmoid fit to rank diversity as an approximation of language core size, i.e. the number of 1-grams minimally required to speak a language. Assuming that these basic words are frequently used (low k) and thus have d(k) < 1, we consider the core size to be bounded by log10 k = µ + 2σ. As Table 4 shows, this value decreases with N , i.e. N -gram structures with larger N tend to have smaller cores. However, if the number of different words found on cores are counted, they increase from monograms to digrams, except for Spanish and Italian. From N = 2, the number of words in cores decreases constantly for all languages. This suggests that core words can be combined to form more complex expressions without the requirement of learning new words. English and French tend to have more words in their cores, while Russian has the least. It is interesting to note that the null model produces cores This is a provisional file, not the final typeset article

6

Morales et al.

Rank Dynamics of Word Usage

with about twice as many words as real 2-grams. Also, only in language cores rank complexity values are not close to zero. In other words, only ranks within the core have a high rank complexity. Our results may have implications for next-word prediction algorithms used in modern typing interfaces like smartphones. Lower ranked N -grams tend to be more predictable (higher z-scores and lower next word entropy on average). Thus, next-word prediction should adjust the N value (scale) depending on the expected rank of the recent, already-typed words. If these are not in top ranked N -grams, then N should be decreased. For example, on the iOS 11 platform, after typing ‘United States of’, the system suggests ‘the’, ‘all’, and ‘a’, as the next-word prediction by analyzing 2-grams. However, it is clear that the most probable next-word is ‘America’, as this is a low-ranked 4-gram. Beyond the previous considerations, perhaps the most relevant aspect of our results is that the rank dynamics of language use is generic not only for all six languages, but for all five scales studied. Whether the generic properties of rank diversity and related measures are universal still remains to be explored. Yet, we expect this and other research questions to be answered in the coming years as more data on language use and human culture becomes available.

4 4.1

METHODS Data description

Data was obtained from the Google Books N -gram dataset 1 , filtered and processed to obtain ranked N -grams for each year for each language. Data considers only the first 11, 140 ranks, as this was the maximum rank available for all time intervals and languages studied. From these, rank diversity, change probability, rank entropy, and rank complexity were calculated as follows. Rank diversity is given by d(k) =

|X(k)| , T

(3)

where |X(k)| is the cardinality (i.e. number of elements) that appear at rank k during all T = 155 time intervals (between 1855 and 2009 with one-year differences, or ∆t = 1). The change probability is Pt=T −1 p(k) =

t=0

1 − δ(X(k, t), X(k, t + 1)) , T −1

(4)

where δ(X(k, t), X(k, t + 1)) is the Kronecker delta; equal to zero if there is a change of N -gram in rank k in ∆t [i.e. the element X(k, t) is different from element X(k, t + 1)], and equal to one if there is no change. The rank entropy is given by |X(k)|

E(k) = −κ

X

pi log pi ,

(5)

i=1

where κ=

1

1 , log2 |X(k)|

(6)

http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

Frontiers

7

Morales et al.

Rank Dynamics of Word Usage

so as to normalize E(k) in the interval [0, 1]. Note that |X(k)| is the alphabet length, i.e. the number of elements that have occurred at rank k. Finally, the rank complexity is calculated using Eq. 2 and Eq. 5 (Fern´andez et al., 2014). 4.2

Modelling shuffled data

We first describe a shuffling process that eliminates any structure found within the 2-gram data, while preserving the frequency of individual words. Consider a sequence consisting of the most frequent word a number of times equal to its frequency, followed by the second most frequent word a number of times equal to its frequency, and so on all the way up to the 11, 140th most frequent word (i.e until all the words in the monogram data have been exhausted). Now suppose we shuffle this sequence and obtain the frequencies of 2-grams in the new sequence. Thus, we have neglected any grammatical rules about which words are allowed to follow which others (we can have the same word twice in the same 2-gram, for example), but the frequency of words remains the same. We now derive an expression for the probability that a 2-gram will have a given frequency after shuffling has been performed. Let fi denote the number of times the word i appears in the text, and fij the number P of times the 2-gram ij appears. Additionally, F = i fi . We want to know the probability P (fij ) that ij appears exactly fij times in the table. We can think of P (fij ) as the probability that exactly fij occurrences of i are followed by j. Supposing fi < fj , fij is determined by fi independent Bernoulli trials with the probability of success equal to the probability that the next word will be j, i.e. fj /F . In this case we have  P (fij ) =

fi fij



fj F

fij 

fj 1− F

fi −fij .

(7)

This distribution meets the condition that allows it to be approximated by a Poisson distribution, namely that fi fj /F is constant, so we have f λijij e−λij P (fij ) ≈ , (8) fij ! where

fi fj F is the mean, and also the variance, of the distribution of values of fij . λij =

(9)

For each 2-gram we calculate the z-score. This is a normalized frequency of its occurrence, i.e. we normalize the actual frequency fij by subtracting the mean of the null distribution and dividing by the standard deviation, fij − µij fij − λij zij = = p . (10) σij λij In other words, the z-score tells us how many standard deviations the actual frequency is from the mean of the distribution derived from the shuffling process. The result is that the 2-grams with the highest z-scores are those which occur relatively frequently but their component words occur relatively infrequently. Normalization. To compare z-scores of different languages, we normalize to eliminate the effects of incomplete data. Specifically, we normalize z-scores by dividing by the upper bound (which happens to be equal in order of magnitude to the lower bound). The highest possible z-score occurs in cases where This is a provisional file, not the final typeset article

8

Morales et al.

Rank Dynamics of Word Usage

fi = fj = fij = f . Therefore λij = f 2 /F and zij ≤



  √ f F 1− < F, F

(11)

√ so an upper bound exists at F . Similarly, The lowest possible z-score would hypothetically occur when fi = fj ≈ F/2 and fij = f , giving √

F zij ≤ F We thus define the normalized z-score as

√   F F 2f − >− . 2 2

(12)

zij zˆij = √ . F

(13)

The relationship between rank and z-score. To understand how the z-score changes as a function of rank, we look at another special case: suppose that i is a word that is found to be the first word in a relatively large number of 2-grams, and that all occurrences of the word j are preceded by i. In such cases we have fi,j = fj , so Eq.(13) reduces to   1 1 − zˆij = (fi fj )1/2 . (14) fi F Now consider only the subset of 2-grams that start with i and end with words that are only ever found to be 1/2 preceded by i. Since fi is constant within this subset, we have zˆij = Afj , where A is a constant. If we now assume that Zipf’s law holds for the set of second words in the subset, i.e. that fj = Brj−1 where rj is −1/2

the rank of j and B another constant, then we have zˆij = Crj

, with C a third constant.

Data. Unlike in other parts of this study, the shuffling analysis is applied to the 105 lowest ranked 2-grams. 4.3

Next-word entropy

The relationship between rank and z-score of 2-grams appears to be, at least partially, a consequence of the existence of high frequency core words that can be followed by many possible next words. This diversity of next words can be quantified by what we call the next-word entropy. Given a word i, we define the next-word entropy, Einw , of i to be the (non-normalized) Shannon entropy of the distribution of 2-gram frequencies of 2-grams that have i as the first word,   X fij fij nw log . (15) Ei = − fi fi i

4.4

Fitting process

The curve fitting for rank diversity, change probability, and rank entropy has been made with the scipy-numpy package using the non-linear least squares method (Levenberg-Marquardt algorithm). For rank entropy, we average data over each ten ranks, k i = Pn/10

Pn/10 i=0

10

ki

, as well as over rank entropy values,

i=0 E(ki ) . 10

E(ki ) = With this averaged data, we adjust a cumulative normal (erf function) over the data of log10 (k i ) and E(ki ). For rank diversity and change probability, we average data over points equally Frontiers

9

Morales et al.

Rank Dynamics of Word Usage

spaced in log10 (ki ). Like for rank entropy, a sigmoid (Eq. 1) is fitted for log10 (k) and d(k), as well as for log10 (k) and p(k). To calculate the mean quadratic error, we use s Pn 2 ˆ i=1 (Xi − Xi ) e= , (16) n where Xˆi is the value of the sigmoid adjusted to rank ki and Xi is the real value of d(ki ). For p(k) and E(k) the error is calculated in the same way.

CONFLICT OF INTEREST STATEMENT The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

AUTHOR CONTRIBUTIONS All authors contributed to the conception of the paper. JAM, EC, and SS processed and analysed the data. EC and GI devised the null model. CP and EC made the figures. EC, GI, JF, and CG wrote sections of the paper. All authors contributed to manuscript revision, read and approved the final version of the article.

FUNDING We acknowledge support from UNAM-PAPIIT Grant No. IG100518, CONACyT Grant No. 285754, and the Fundaci´on Marcos Moshinsky.

SUPPLEMENTAL DATA Additional Tables and Figures.

REFERENCES Acerbi, A., Lampos, V., Garnett, P., and Bentley, R. A. (2013). The expression of emotions in 20th century books. PLoS ONE 8, e59030. doi:10.1371/journal.pone.0059030 Baek, S. K., Bernhardsson, S., and Minnhagen, P. (2011). Zipf’s law unzipped. New Journal of Physics 13, 043004 Baronchelli, A., Felici, M., Loreto, V., Caglioti, E., and Steels, L. (2006). Sharp transition towards shared vocabularies in multi-agent systems. Journal of Statistical Mechanics: Theory and Experiment 2006, P06014 Blumm, N., Ghoshal, G., Forr´o, Z., Schich, M., Bianconi, G., Bouchaud, J.-P., et al. (2012). Dynamics of ranking processes in complex systems. Physical Review Letters 109, 128701 Cancho, R. F. i. and Sol´e, R. V. (2003). Least effort and the origins of scaling in human language. Proceedings of the National Academy of Sciences 100, 788–791. doi:10.1073/pnas.0335980100 Cocho, G., Flores, J., Gershenson, C., Pineda, C., and S´anchez, S. (2015). Rank diversity of languages: Generic behavior in computational linguistics. PLoS ONE 10, e0121898. doi:10.1371/journal.pone. 0121898 Corominas-Murtra, B., Fortuny, J., and Sol´e, R. V. (2011). Emergence of Zipf’s law in the evolution of communication. Phys. Rev. E 83, 036115. doi:10.1103/PhysRevE.83.036115 This is a provisional file, not the final typeset article

10

Morales et al.

Rank Dynamics of Word Usage

Dodds, P. S., Clark, E. M., Desu, S., Frank, M. R., Reagan, A. J., Williams, J. R., et al. (2015). Human language reveals a universal positivity bias. Proceedings of the National Academy of Sciences 112, 2389–2394. doi:10.1073/pnas.1411678112 Dodds, P. S., Harris, K. D., Kloumann, I. M., Bliss, C. A., and Danforth, C. M. (2011). Temporal patterns of happiness and information in a global social network: Hedonometrics and twitter. PloS one 6, e26752 Febres, G., Jaffe, K., and Gershenson, C. (2015). Complexity measurement of natural and artificial languages. Complexity 20, 25–48. doi:10.1002/cplx.21529 Fern´andez, N., Maldonado, C., and Gershenson, C. (2014). Information measures of complexity, emergence, self-organization, homeostasis, and autopoiesis. In Guided Self-Organization: Inception, ed. M. Prokopenko (Berlin Heidelberg: Springer), vol. 9 of Emergence, Complexity and Computation. 19–51. doi:10.1007/978-3-642-53734-9 2 Gerlach, M. and Altmann, E. G. (2013). Stochastic model for the vocabulary growth in natural languages. Phys. Rev. X 3, 021006. doi:10.1103/PhysRevX.3.021006 Gerlach, M., Font-Clos, F., and Altmann, E. G. (2016). Similarity of symbol frequency distributions with heavy tails. Phys. Rev. X 6, 021009. doi:10.1103/PhysRevX.6.021009 Gershenson, C. and Fern´andez, N. (2012). Complexity and information: Measuring emergence, selforganization, and homeostasis at multiple scales. Complexity 18, 29–44. doi:10.1002/cplx.21424 Ghanbarnejad, F., Gerlach, M., Miotto, J. M., and Altmann, E. G. (2014). Extracting information from s-curves of language change. Journal of The Royal Society Interface 11. doi:10.1098/rsif.2014.1044 Lieberman, E., Michel, J.-B., Jackson, J., Tang, T., and Nowak, M. A. (2007). Quantifying the evolutionary dynamics of language. Nature 449, 713 EP – Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Team, T. G. B., et al. (2011). Quantitative analysis of culture using millions of digitized books. Science 331, 176–182. doi:10.1126/science. 1199644 Morales, J. A., S´anchez, S., Flores, J., Pineda, C., Gershenson, C., Cocho, G., et al. (2016). Generic temporal features of performance rankings in sports and games. EPJ Data Science 5, 33. doi:10.1140/ epjds/s13688-016-0096-y Newman, M. E. (2005). Power laws, Pareto distributions and Zipf’s law. Contemporary Physics 46, 323–351 Nowak, M. A. and Krakauer, D. C. (1999). The evolution of language. Proceedings of the National Academy of Sciences 96, 8028–8033. doi:10.1073/pnas.96.14.8028 Perc, M. (2012). Evolution of the most common English words and phrases over the centuries. Journal of The Royal Society Interface 9, 3323–3328. doi:10.1098/rsif.2012.0491 Perc, M. (2013). Self-organization of progress across the century of physics. Scientific Reports 3, 1720 Petersen, A. M., Tenenbaum, J., Havlin, S., and Stanley, H. E. (2012a). Statistical laws governing fluctuations in word use from word birth to word death. Scientific Reports 2, 313 Petersen, A. M., Tenenbaum, J. N., Havlin, S., Stanley, H. E., and Perc, M. (2012b). Languages cool as they expand: Allometric scaling and the decreasing need for new words. Scientific Reports 2, 943. doi:10.1038/srep00943 Pi˜na-Garcia, C. A., Gershenson, C., and Siqueiros-Garc´ıa, J. M. (2016). Towards a standard sampling methodology on online social networks: Collecting global trends on Twitter. Applied Network Science 1, 3. doi:10.1007/s41109-016-0004-1 Pi˜na-Garc´ıa, C. A., Siqueiros-Garc´ıa, J. M., Robles-Belmont, E., Carre´on, G., Gershenson, C., and L´opez, J. A. D. (2018). From neuroscience to computer science: a topical approach on Twitter. Journal of Computational Social Science 1, 187–208. doi:10.1007/s42001-017-0002-9 Frontiers

11

Morales et al.

Rank Dynamics of Word Usage

Santamar´ıa-Bonfil, G., Fern´andez, N., and Gershenson, C. (2016). Measuring the complexity of continuous distributions. Entropy 18, 72. doi:10.3390/e18030072 ´ Bogu˜na´ , M., Haro, M., and Arcos, J. L. (2012). Measuring the evolution of Serr`a, J., Corral, A., contemporary western popular music. Scientific Reports 2, 521. doi:10.1038/srep00521 Sol´e, R. V., Valverde, S., Casals, M. R., Kauffman, S. A., Farmer, D., and Eldredge, N. (2013). The evolutionary ecology of technological innovations. Complexity 18, 15–27. doi:10.1002/cplx.21436 ˇ Tadi´c, B., Gligorijevi´c, V., Mitrovi´c, M., and Suvakov, M. (2013). Co-evolutionary mechanisms of emotional bursts in online social dynamics and networks. Entropy 15, 5084–5120. doi:10.3390/ e15125084 Wagner, C., Singer, P., and Strohmaier, M. (2014). The nature and evolution of online food preferences. EPJ Data Science 3, 38. doi:10.1140/epjds/s13688-014-0036-7 Wijaya, D. T. and Yeniterzi, R. (2011). Understanding semantic change of words over centuries. In Proceedings of the 2011 international workshop on DETecting and Exploiting Cultural diversiTy on the social web (ACM), 35–40 Wilkens, M. (2015). Digital humanities and its application in the study of literature and culture. Comparative Literature 67, 11–20. doi:10.1215/00104124-2861911 Zipf, G. K. (1932). Selective Studies and the Principle of Relative Frequency in Language (Cambridge, MA, USA: Harvard University Press)

This is a provisional file, not the final typeset article

12

Morales et al.

Rank Dynamics of Word Usage

FIGURE CAPTIONS  1-grams  2-grams 1000

 3-grams  4-grams  5-grams

k

100

10

1 1900

1950

2000

year Figure 1. Rank evolution of N -grams in French. Rank trajectories across time for N -grams (N = 1, . . . , 5) that are initially in rank 1, 10, 100 and 1000 at the year 1855. The plot is semilogarithmic, so similar changes across ranks correspond to changes proportional to the rank itself. Other languages (not shown) behave in a similar way: changes are more frequent as N increases. We have added a small shift over the y-axis for some curves, to see more clearly how the most frequently used N -grams remain at k = 1 for long periods of time.

Frontiers

13

Morales et al.

Rank Dynamics of Word Usage

.

d(k)

1.0

English

1.0

French

1.0

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

1 1.0

2

3

4

Italian

1 1.0

2

3

4

Russian

1 1.0

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

1

2

3

4

1

2

3

German

4

    

1-gram 2-gram 3-gram 4-gram 5-gram

2

3

4

2

3

4

Spanish

1

log10 k Figure 2. Rank diversity for different languages and N -grams. Binned rank diversity d(k) as a function of rank k for all languages and N values considered (continuous lines). We also include fits according to the sigmoid in Equation (1) (dashed lines), with µ, σ and the associated e error summarised in Table 1. Windowing is done averaging d(k) every 0.05 in log10 k.

. 0.95

55 5

 English

4

0.90

4

0.85

5

σ

0.80

33 4 5 43 33 3 42

    

French German Italian Russian Spanish

0.75

5

0.70

2 2 212 2 11 1 11

4

0.65 0.60 1.0

1.2

1.4

1.6

1.8

2.0

2.2

µ Figure 3. Fitted parameters for rank diversity. Parameters µ and σ for the sigmoid fit of the rank diversity d(k), for all languages (indicated by colors) and N values (indicated by numbers). We observe an (approximate) inversely proportional relation between µ and σ. This is a provisional file, not the final typeset article

14

Morales et al.

Rank Dynamics of Word Usage

.

d(k)

1.0

English

1.0

French

1.0

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

1 1.0

2

3

4

Italian

1 1.0

2

3

4

Russian

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

2

3

4

1

2

3

4

 2-gram  null 1

1.0

0.8

1

German

2

3

4

2

3

4

Spanish

1

log10 k

Figure 4. Rank diversity in the null model. Rank diversity d(k) of both empirical and randomly generated 2-grams. The rank diversity of the null model tends to be to the right of that of the data, i.e. lower. Windowing in this and similar figures is done in the same way as in Figure 2.

Figure 5. z-scores between empirical and null model digrams. z-scores, calculated from Eq.(13), for the top 105 2-grams in 2008. Each point represents a 2-gram found in the empirical data. The blue dots show the median z-score of logarithmically binned values (first bin contains only the first rank; each consecutive bin is twice as large as the previous one). The inset shows the same values on a linear scale y axis, with a dashed line indicating z = 0. Frontiers

15

Morales et al.

Rank Dynamics of Word Usage

Figure 6. Next-word entropy for different languages and null model. Probability distribution of nextword entropies, calculated using Eq. (15). The range of entropies is segregated into bins of width 1/2, while the probability is calculated as the number of words whose next-word entropy falls inside the bin, divided by the total number of words.

.

p(k)

1.0

English

1.0

French

1.0

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

1 1.0

2

3

4

Italian

1 1.0

2

3

4

Russian

1 1.0

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

1

2

3

4

1

2

3

4

German     

1-gram 2-gram 3-gram 4-gram 5-gram

2

3

4

2

3

4

Spanish

1

log10 k Figure 7. Change probability for different languages and N -grams. Binned change probability p(k) as a function of rank k for all languages and N values considered (continuous lines). We also include fits according to the sigmoid in Equation (1) (dashed lines), as in Figure 2. This is a provisional file, not the final typeset article

16

Morales et al.

Rank Dynamics of Word Usage

.

4

5 0.8

5

3

 English     

5

French German Italian Russian Spanish

33 3 5 2 4 43 2 12 4 2 2 1 3 1 5 4 5 1 1

0.7

4 σ

0.6

0.5

0.4

1 0.0

0.5

1.0

1.5

µ Figure 8. Fitted parameters for change probability. Parameters µ and σ for the sigmoid fit of the change probability p(k), for all languages (indicated by colors) and N values (indicated by numbers).

.

E(k)

1.0

English

1.0

French

1.0

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

1 1.0

2

3

4

Italian

1 1.0

2

3

4

Russian

1 1.0

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

1

2

3

4

1

2

3

4

German     

1-gram 2-gram 3-gram 4-gram 5-gram

2

3

4

2

3

4

Spanish

1

log10 k Figure 9. Rank entropy for different languages and N -grams. Binned rank entropy E(k) as a function of rank k for all languages and N values considered (continuous lines). We also include fits according to the sigmoid in Equation (1) (dashed lines), as in Figure 2. Frontiers

17

Morales et al.

Rank Dynamics of Word Usage

.

1.4

33 32 2 44 44 3

1.3

σ

1.2

5

 English

5 5

3

5

3 4 4

    

2

French German Italian Russian Spanish

1.1

5 1.0

2 21 1

0.9

1 5

0.8 -0.5

0.0

11 1 2

0.5

µ

Figure 10. Fitted parameters for rank entropy. Parameters µ and σ for the sigmoid fit of rank entropy E(k), for all languages (indicated by colors) and N values (indicated by numbers).

.

C(k)

1.0

English

1.0

French

1.0

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

1 1.0

2

3

4

Italian

1 1.0

2

3

4

Russian

1 1.0

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

1

2

3

4

1

2

3

4

German     

1-gram 2-gram 3-gram 4-gram 5-gram

2

3

4

2

3

4

Spanish

1

log10 k Figure 11. Rank complexity for different languages and N -grams. Binned rank complexity C(k) as a function of rank k for all languages and N values considered. Rank complexity tends to be greater for lower N , as rank entropy increases with N . This is a provisional file, not the final typeset article

18

Morales et al.

Rank Dynamics of Word Usage

TABLES English French German Italian Russian Spanish

µ 2.281 2.273 2.234 2.208 2.074 2.137

1grams σ 0.62 0.63 0.611 0.638 0.613 0.69

e 0.019 0.02 0.018 0.017 0.014 0.017

µ 2.135 2.183 2.123 2.031 1.811 2.076

2grams σ 0.721 0.689 0.695 0.717 0.767 0.675

e 0.016 0.016 0.015 0.014 0.012 0.017

µ 1.822 1.815 1.684 1.663 1.545 1.701

3grams σ 0.823 0.818 0.839 0.817 0.786 0.843

e 0.013 0.013 0.012 0.012 0.01 0.012

µ 1.755 1.646 1.461 1.261 1.427 1.379

4grams σ 0.775 0.82 0.818 0.935 0.713 0.904

e 0.012 0.011 0.009 0.009 0.008 0.01

µ 1.553 1.385 0.958 0.953 1.264 1.025

5grams σ 0.822 0.849 0.936 0.95 0.705 0.96

e 0.01 0.01 0.007 0.007 0.006 0.008

Random 2grams µ σ e 2.605 0.598 0.024 2.684 0.598 0.022 2.509 0.636 0.02 2.53 0.627 0.019 2.228 0.628 0.017 2.573 0.551 0.024

Table 1. Fit parameters for rank diversity for different languages, N -grams and null model. Mean µ, standard deviation σ, and error e for the sigmoid fit of the rank diversity d(k) according to Equation (1). We also show the fit parameters for the null model of Figure 4.

English French German Italian Russian Spanish

µ 1.494 1.618 1.488 1.467 1.159 1.493

1grams σ 0.549 0.427 0.515 0.429 0.595 0.355

e 0.008 0.009 0.009 0.009 0.007 0.009

µ 1.295 1.307 1.219 1.054 0.802 1.295

2grams σ 0.54 0.573 0.58 0.616 0.684 0.568

e 0.009 0.008 0.007 0.007 0.005 0.009

µ 0.862 0.761 0.53 0.574 0.778 0.621

3grams σ 0.635 0.711 0.814 0.708 0.527 0.712

e 0.006 0.006 0.004 0.004 0.004 0.004

µ 0.816 0.814 0.567 0.361 0.725 0.299

4grams σ 0.642 0.589 0.635 0.655 0.495 0.826

e 0.005 0.004 0.003 0.004 0.003 0.003

µ 0.598 0.733 0.125 -0.091 0.556 0.116

5grams σ 0.677 0.453 0.726 0.824 0.486 0.793

e 0.005 0.004 0.003 0.002 0.003 0.003

Table 2. Fit parameters for change probability for different languages. Mean µ, standard deviation σ, and error e for the sigmoid fit of the change probability p(k) according to Equation (1).

English French German Italian Russian Spanish

µ 0.751 0.684 0.756 0.731 0.557 0.597

1grams σ 0.898 0.979 0.885 0.895 0.881 0.986

e 0.009 0.012 0.009 0.01 0.009 0.011

µ 0.736 0.455 0.111 -0.264 -0.456 0.527

2grams σ 0.836 1.048 1.214 1.35 1.338 0.973

e 0.009 0.01 0.007 0.004 0.003 0.008

µ -0.446 -0.105 -0.478 -0.375 -0.125 -0.542

3grams σ 1.369 1.195 1.34 1.263 1.062 1.373

e 0.003 0.006 0.003 0.002 0.003 0.002

µ -0.389 -0.431 -0.569 -0.163 -0.102 -0.524

4grams σ 1.301 1.294 1.27 1.027 0.992 1.272

e 0.002 0.002 0.001 0.003 0.002 0.002

µ -0.339 -0.333 -0.435 -0.727 0.073 -0.349

5grams σ 1.198 1.146 1.044 1.191 0.816 1.065

e 0.004 0.002 0.001 0.001 0.002 0.001

Table 3. Fit parameters for rank entropy for different languages. Mean µ, standard deviation σ, and error e for the sigmoid fit of the rank entropy E(k) according to Equation (1).

English French German Italian Russian Spanish

1grams µ + 2σ # words 3.521 3004 3.534 3173 3.456 2622 3.485 2856 3.3 1827 3.516 2988

2grams µ + 2σ # words 3.578 3553 3.561 3469 3.512 3057 3.465 2815 3.345 1997 3.426 2529

3grams µ + 2σ # words 3.468 2813 3.45 2705 3.362 2198 3.297 1919 3.118 1216 3.386 2339

4grams µ + 2σ # words 3.304 1949 3.287 1830 3.097 1203 3.131 1303 2.853 681 3.187 1479

5grams µ + 2σ # words 3.196 1512 3.083 1134 2.83 662 2.854 696 2.674 452 2.945 842

Random 2grams µ + 2σ # words 3.801 6322 3.881 7601 3.78 6032 3.784 6078 3.483 3042 3.675 4728

Table 4. Language core parameters. Upper bound rank log10 k = µ + 2σ for the estimated core size of all languages studied, according to the sigmoid fit of Equation (1), as well as the number of words included in the N -grams within the core in the year 2009.

Frontiers

19

Supplementary Material: Rank Dynamics of Word Usage at Multiple Scales 1

SUPPLEMENTARY TABLES AND FIGURES

1.1

Figures

. English

d(k) − p(k)

-0.1

1

2

3

4

-0.1

-0.2

-0.2

-0.3

-0.3

-0.4

-0.4

-0.5

-0.5

1

2

3

-0.6

-0.6 1

4

-0.1 -0.2 -0.3 -0.4 -0.5 -0.6

2

3

4 -0.1

1

1

2

3

    

4

1-gram 2-gram 3-gram 4-gram 5-gram

Spanish

Russian

Italian -0.1 -0.2 -0.3 -0.4 -0.5 -0.6

German

French

2

3

4 -0.1

1

2

3

4

-0.2 -0.3 -0.4 -0.5 -0.6

-0.2 -0.3 -0.4 -0.5 -0.6

log10 k Figure S1: Difference between rank diversity and change probability across languages and N -grams. As p(k) grows faster than d(k), these curves decrease their value to a minimum close to 0.6, for then increasing when d(k) starts growing and become zero when both measures reach their maximum value of one. Windowing is done averaging d(k) every 0.05 in logk.

1

Supplementary Material

1.2

Tables Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

1700 the of and to in a is his he that for not as i this it ’ by be and their was surveyor with upon but no are have were

1800 the of and to in a that is i his it was with he as for be not by which the on at this had from have their but you

1900 the of and to in a that is was as i his with it for the he by be not which on at had or from are have this ’

2000 the of and to a in that is for the was with as i not on ’s it be by or are you from at his he have an this

Table S1. Top English 1-grams for arbitrary years.

2

Supplementary Material

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

1700 of–the in–the to–the the–surveyor and–the upon–the for–the that–the by–the of–his to–be the–stage in–his to–make with–the is–not the–surveyor the–drama as–the of–their not–to does–not at–the and–that in–their he–is from–the the–reader that–the out–of

1800 of–the in–the to–the and–the to–be on–the by–the of–his from–the with–the of–a for–the it–is at–the that–the in–a all–the is–the with–a it–was in–his i–have as–the he–was have–been that–he the–same of–their had–been of–this

1900 of–the in–the to–the and–the on–the to–be by–the of–a for–the from–the with–the at–the that–the it–is of–his in–a with–a the–same is–the it–was is–a it–is as–the have–been as–a he–was had–been may–be all–the that–he

2000 of–the in–the to–the on–the and–the for–the to–be of–a from–the with–the at–the that–the in–a by–the as–a is–a is–the it–is do–not with–a did–not can–be as–the to–a the–same for–a is–not into–the it–was was–a

Table S2. Top English 2-grams for arbitrary years.

Frontiers

3

Supplementary Material

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

1700 so–much–as of–the–stage the–surveyor–’s by–the–surveyor as–the–surveyor up–to–the the–rest–of the–english–stage as–well–as to–this–i part–of–the of–the–fable not–to–be not–so–much of–the–play of–the–drama comedy–and–tragedy but–the–surveyor upon–the–stage tragedy–and–comedy the–surveyor–is the–reader–may the–authority–of rest–of–the one–would–think is–not–so and–as–for a–man–’s to–no–purpose to–make–the

1800 one–of–the as–well–as part–of–the i–do–not out–of–the the–name–of in–order–to i–can–not can–not–be the–same–time and–in–the of–the–world according–to–the that–it–was of–the–most that–he–was is–to–be the–united–states at–the–same it–is–not to–have–been that–of–the that–it–is that–he–had of–all–the to–be–a not–to–be in–the–same in–the–world there–is–no

1900 one–of–the the–united–states part–of–the i–do–not as–well–as out–of–the can–not–be and–in–the in–order–to the–end–of in–which–the of–the–united that–he–was is–to–be i–can–not of–the–most the–same–time that–of–the that–it–was the–name–of some–of–the of–the–world that–he–had the–fact–that in–the–same side–of–the it–is–not that–it–is to–be–a at–the–same

2000 one–of–the as–well–as the–united–states i–do–not part–of–the the–end–of out–of–the in–order–to end–of–the some–of–the the–number–of to–be–a i–did–not be–able–to the–use–of a–number–of can–not–be the–fact–that in–the–united do–not–know the–same–time in–terms–of in–which–the there–is–a a–lot–of there–is–no i–can–not the–rest–of you–do–not side–of–the

Table S3. Top English 3-grams for arbitrary years.

4

Supplementary Material

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

1700 not–so–much–as to–this–i–answer the–rest–of–the this–may–serve–to the–authority–of–the i–must–tell–him but–it–seems–the as–far–as–it and–this–may–serve a–great–part–of a–great–deal–of whence–it–will–follow upon–the–english–stage upon–the–account–of to–take–notice–of the–moral–of–the the–business–of–the so–much–as–a reader–may–please–to me–in–mind–of it–may–be–so in–the–time–of in–the–reign–of i–shall–endeavour–to he–is–pleased–to far–as–it–appears end–of–the–play but–i–’–m at–the–end–of as–well–as–the

1800 at–the–same–time of–the–united–states at–the–head–of in–the–midst–of the–name–of–the in–the–county–of one–of–the–most i–do–not–know on–the–part–of as–well–as–the in–the–course–of the–rest–of–the in–the–name–of the–end–of–the the–head–of–the in–a–state–of for–the–purpose–of in–the–way–of the–hands–of–the on–the–other–hand for–the–most–part for–the–sake–of the–middle–of–the the–town–of–mansoul is–one–of–the at–the–end–of the–part–of–the into–the–hands–of for–the–first–time the–nature–of–the

1900 of–the–united–states at–the–same–time the–end–of–the one–of–the–most i–do–not–know on–the–part–of in–the–case–of for–the–purpose–of at–the–end–of is–one–of–the in–the–midst–of in–the–united–states on–the–other–hand as–well–as–the for–the–first–time was–one–of–the the–rest–of–the the–head–of–the in–the–course–of the–part–of–the the–middle–of–the a–member–of–the on–the–other–hand at–the–time–of at–the–head–of in–the–form–of a–part–of–the for–the–sake–of the–name–of–the in–the–hands–of

2000 in–the–united–states the–end–of–the i–do–not–know at–the–end–of at–the–same–time of–the–united–states as–well–as–the the–rest–of–the on–the–other–hand one–of–the–most as–a–result–of is–one–of–the in–the–case–of in–the–form–of do–not–want–to on–the–basis–of for–the–first–time did–not–want–to at–the–same–time on–the–other–hand i–do–not–think in–the–middle–of the–top–of–the at–the–time–of can–be–used–to the–middle–of–the in–front–of–the i–do–not–want the–beginning–of–the at–the–university–of

Table S4. Top English 4-grams for arbitrary years.

Rank 1700 1800 1 and–this–may–serve–to on–the–part–of–the 2 as–far–as–it–appears in–the–name–of–the 3 will–not–so–much–as and–at–the–same–time 4 we–have–no–reason–to at–the–head–of–the 5 there–’s–no–need–of in–the–midst–of–the 6 the–root–of–all–evil the–other–side–of–the 7 the–heat–of–the–climate into–the–hands–of–the 8 the–end–of–the–play the–name–of–the–lord 9 the–ancient–and–modern–stages at–the–end–of–the 10 puts–me–in–mind–of in–the–middle–of–the 11 not–so–much–as–offer in–the–hands–of–the 12 not–so–much–as–a on–the–banks–of–the 13 may–not–be–amiss–to at–the–head–of–a 14 it–not–been–for–the in–the–course–of–the 15 it–may–not–be–amiss on–the–other–side–of 16 is–there–no–difference–between of–the–town–of–mansoul 17 is–not–so–much–as the–greater–part–of–the 18 is–an–odd–way–of at–the–foot–of–the 19 i–’–m–afraid–i at–the–bottom–of–the 20 from–whence–it–will–follow the–inferior–extremity–of–the 21 does–not–so–much–as in–the–beginning–of–the 22 does–not–in–the–least at–the–head–of–his 23 but–i–’–m–afraid i–do–not–know–what 24 at–the–latter–end–of the–inner–face–of–the 25 at–the–end–of–the to–be–found–in–the 26 at–that–time–of–day is–one–of–the–most 27 as–it–would–be–to president–of–the–united–states 28 ancient–and–modern–stages–surveyed in–the–form–of–a 29 according–to–the–custom–of the–posterior–face–of–the 30 a–great–part–of–the the–superior–extremity–of–the Table S5. Top English 5-grams for arbitrary years.

Frontiers

1900 on–the–part–of–the at–the–end–of–the and–at–the–same–time at–the–time–of–the the–other–side–of–the at–the–head–of–the in–the–middle–of–the in–the–case–of–the is–one–of–the–most in–the–hands–of–the in–the–midst–of–the in–the–form–of–a on–the–other–side–of at–the–foot–of–the in–the–direction–of–the as–in–the–case–of to–be–found–in–the in–the–history–of–the i–do–not–know–what at–the–beginning–of–the in–the–course–of–the the–greater–part–of–the president–of–the–united–states at–the–close–of–the in–the–name–of–the into–the–hands–of–the at–the–bottom–of–the i–do–not–want–to was–a–member–of–the on–the–side–of–the

2000 at–the–end–of–the in–the–middle–of–the i–do–not–want–to the–united–states–of–america i–do–not–know–what the–other–side–of–the at–the–time–of–the at–the–beginning–of–the as–a–result–of–the on–the–part–of–the is–one–of–the–most in–the–united–states–of at–the–top–of–the at–the–end–of–the i–did–not–want–to in–the–form–of–a at–the–bottom–of–the on–the–other–side–of in–such–a–way–that in–the–united–states–and for–the–first–time–in and–at–the–same–time will–not–be–able–to in–the–center–of–the in–the–case–of–the i–do–not–know–how printed–in–the–united–states by–the–end–of–the you–do–not–have–to you–do–not–have–to

5

Supplementary Material

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

1700 de la que le les à qui en des du par pour dans ne ce est un il a vous une l nous plus au pas se d on y

1800 de la et les le à des que qui dans en du un une par est plus ne il pour se ce a sur on au pas ou l sont

1900 de la et à les le des en du que un une dans qui par est pour il au a ne plus se pas sur ce ou on nous sont

2000 de la et à le des les en du un une que dans qui par est pour au pas sur a plus se ne il ce ou la le sont

Table S6. Top French 1-grams for arbitrary years.

6

Supplementary Material

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

1700 de–la à–la que–les dans–le que–le dans–la que–la de–l ce–qui que–nous tous–les que–vous dans–les ce–que de–ce qui–est de–dieu par–le par–la il–est toutes–les ceux–qui de–reims y–a de–cette qui–ne par–les que–de de–ladite de–ces

1800 de–la à–la et–de dans–les que–les dans–le dans–la et–les que–le que–la tous–les et–la de–ces et–le de–l de–ce de–cette sur–les par–la par–le toutes–les ce–qui sur–la par–les de–son de–ses sur–le et–des que–nous de–leur

1900 de–la à–la et–de dans–les dans–le dans–la que–les que–le et–les que–la et–la et–le par–le sur–les par–la sur–la tous–les par–les sur–le de–ces de–ce et–des a–été de–son de–cette de–l ce–qui pour–les pour–la y–a

2000 de–la à–la et–de dans–le dans–la dans–les et–les et–la que–les et–le que–le sur–le que–la de–l sur–la par–le et–des par–la ce–qui sur–les par–les de–son de–ce à–une pour–les de–cette et–à à–un dans–un de–ces

Table S7. Top French 2-grams for arbitrary years.

Frontiers

7

Supplementary Material

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

1700 de–tous–les mil–six–cens il–y–a tout–ce–qui que–nous–avons ce–qui–est de–ceux–qui tout–ce–que duc–de–reims de–toutes–les six–cens–quatre la–somme–des ce–que–vous ce–que–nous tous–les–jours la–somme–de ville–de–reims de–la–nature pieces–de–comparaison de–ce–que archevêque–duc–de entre–les–mains amour–de–dieu à–ceux–qui du–mois–de aprés–le–choc que–ce–soit que–vous–ne que–nous–ne il–est–évident

1800 et–de–la de–tous–les plus–ou–moins il–y–a de–toutes–les le–nom–de de–la–nature de–la–terre ne–sont–pas tout–ce–qui il–y–a de–la–même la–plus–grande que–nous–avons grand–nombre–de dans–tous–les un–grand–nombre de–la–mer ce–qui–est et–à–la de–ceux–qui une–espèce–de dans–toutes–les la–plupart–des à–tous–les que–dans–les au–lieu–de de–la–vie de–la–france on–ne–peut

1900 et–de–la il–y–a point–de–vue de–la–loi au–point–de de–la–société plus–ou–moins ne–sont–pas en–même–temps il–y–a de–tous–les à–peu–près la–loi–du que–nous–avons à–la–fois le–nom–de de–la–vie et–à–la ce–qui–concerne de–la–ville le–droit–de de–toutes–les de–la–france tout–à–fait ce–qui–est de–plus–en plus–en–plus chemins–de–fer chemin–de–fer la–fin–de

2000 et–de–la à–la–fois il–y–a ne–sont–pas de–la–vie point–de–vue à–partir–de plus–en–plus la–mise–en de–plus–en dans–le–cadre ce–qui–est et–à–la de–la–société de–la–population il–y–a par–rapport–à en–tant–que à–la–fin la–fin–de la–fin–du de–la–ville plus–ou–moins de–la–france en–même–temps de–tous–les de–la–loi la–plupart–des de–la–république ce–qui–concerne

Table S8. Top French 3-grams for arbitrary years.

8

Supplementary Material Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

1700 mil–six–cens–quatre archevêque–duc–de–reims premier–pair–de–france mil–six–cens–soixante il–est–évident–que la–grace–de–dieu procureur–general–du–roy la–commodité–ou–incommodité à–la–requeste–de de–ladite–eglise–de tout–ce–que–vous par–la–grace–de de–tout–ce–qui des–pieces–de–comparaison la–gloire–de–dieu il–est–aisé–de de–la–ville–de commodité–ou–incommodité–de la–piece–arguée–de de–ladite–abbaye–de tout–ce–qui–est il–est–certain–que aprés–la–mort–de par–le–moyen–de il–ne–faut–pas de–nostredite–ville–de contre–les–conclusions–du à–la–somme–des les–conclusions–du–demandeur jour–du–mois–de

1800 un–grand–nombre–de sous–le–nom–de il–y–en–a la–plus–grande–partie de–la–même–manière de–plus–en–plus de–la–loi–du que–je–viens–de la–surface–de–la que–nous–venons–de il–y–a–des en–est–de–même à–la–fin–de de–tout–ce–qui qui–ne–sont–pas toutes–les–parties–de par–le–moyen–de de–la–république–française les–uns–des–autres une–grande–quantité–de à–la–tête–de pour–la–première–fois il–y–a–des plus–grande–partie–de il–ne–faut–pas de–la–part–de de–la–plus–grande le–plus–grand–nombre de–la–part–des dans–le–cours–de

1900 au–point–de–vue de–plus–en–plus point–de–vue–de de–la–loi–du en–ce–qui–concerne sous–le–nom–de un–certain–nombre–de un–grand–nombre–de en–même–temps–que des–chemins–de–fer de–vue–de–la à–la–fin–de ce–qui–concerne–les pour–la–première–fois du–chemin–de–fer à–la–suite–de de–la–ville–de en–ce–qui–concerne de–la–loi–de ce–point–de–vue en–est–de–même ce–qui–concerne–la il–ne–faut–pas au–point–de–vue que–nous–venons–de la–plus–grande–partie y–a–lieu–de il–y–a–des de–la–part–de de–chemins–de–fer

2000 de–plus–en–plus dans–la–mesure–où dans–le–cadre–de en–ce–qui–concerne un–certain–nombre–de la–mise–en–place du–point–de–vue à–la–fin–du pour–la–première–fois à–la–fin–de la–mise–en–œuvre la–fin–de–la au–sein–de–la dans–le–domaine–de par–rapport–à–la en–ce–qui–concerne sous–la–direction–de ce–qui–concerne–les la–fin–des–années la–question–de–la un–point–de–vue point–de–vue–de le–cadre–de–la au–cours–de–la à–partir–de–la de–la–mise–en de–la–part–de à–la–suite–de de–la–part–des qui–ne–sont–pas

Table S9. Top French 4-grams for arbitrary years.

Rank 1700 1800 1 par–la–grace–de–dieu la–plus–grande–partie–de 2 la–commodité–ou–incommodité–de il–en–est–de–même 3 pour–y–avoir–recours–en de–la–manière–la–plus 4 la–grace–de–dieu–roy suit–la–teneur–de–la 5 grace–de–dieu–roy–de la–teneur–de–la–déclaration 6 de–dieu–roy–de–france dont–nous–venons–de–parler 7 cour–de–parlement–de–paris la–surface–de–la–terre 8 de–grace–mil–six–cens anciens–approuve–la–résolution–ci 9 louis–par–la–grace–de dont–je–viens–de–parler 10 par–ces–presentes–signées–de la–présente–résolution–sera–imprimée 11 de–la–commodité–ou–incommodité conseil–des–anciens–approuve–la 12 y–avoir–recours–en–cas des–anciens–approuve–la–résolution 13 du–grand–sceau–de–cire urgence–et–de–la–résolution 14 car–tel–est–nostre–plaisir qui–précède–la–résolution–ci 15 avoir–recours–en–cas–de le–conseil–des–anciens–approuve 16 ordonné–ce–que–de–raison urgence–qui–précède–la–résolution 17 les–lettres–patentes–du–roy et–de–la–résolution–du 18 la–commodité–ou–incommodité–que inséré–au–bulletin–des–lois 19 de–la–cour–de–parlement nom–de–la–republique–française 20 égale–à–la–somme–des au–nom–de–la–republique 21 patentes–du–roy–données–à adoptant–les–motifs–de–la 22 le–procureur–general–du–roy sera–inséré–au–bulletin–des 23 du–procureur–general–du–roy un–plus–grand–nombre–de 24 pour–en–jouir–par–ledit toutes–les–parties–de–la 25 la–chambre–des–comptes–de les–unes–sur–les–autres 26 grand–sceau–de–cire–verte de–la–même–manière–que 27 de–monsieur–le–procureur–general en–est–de–même–de 28 chambre–des–comptes–de–paris sur–le–rapport–du–ministre 29 recours–en–cas–de–besoin les–uns–sur–les–autres 30 quatre–mille–cinq–cens–livres qui–sera–inséré–au–bulletin Table S10. Top French 5-grams for arbitrary years.

Frontiers

1900 au–point–de–vue–de point–de–vue–de–la en–ce–qui–concerne–les à–ce–point–de–vue au–fur–et–à–mesure il–en–est–de–même au–point–de–vue–du en–ce–qui–concerne–la au–point–de–vue–des il–y–a–lieu–de la–plus–grande–partie–de il–y–a–quelques–années en–ce–qui–concerne–le en–ce–qui–concerne–les du–chemin–de–fer–de à–la–suite–de–la des–chemins–de–fer–de il–en–est–de–même à–la–fin–de–la au–point–de–vue–de de–la–façon–la–plus dans–le–sens–de–la la–séance–est–levée–à la–question–de–savoir–si dans–la–plupart–des–cas le–président–de–la–république en–ce–qui–concerne–la en–même–temps–que–la en–même–temps–que–le de–la–manière–la–plus

2000 dans–le–cadre–de–la dans–le–domaine–de–la du–point–de–vue–de au–fur–et–à–mesure en–ce–qui–concerne–les la–mise–en–place–de la–mise–en–œuvre–de en–ce–qui–concerne–la à–la–fin–des–années dans–la–mesure–où–il point–de–vue–de–la à–la–fin–de–la en–ce–qui–concerne–les de–ce–point–de–vue dans–la–mesure–où–elle sur–le–marché–du–travail de–ce–point–de–vue réduction–du–temps–de–travail la–fin–du–xixe–siècle fur–et–à–mesure–que en–ce–qui–concerne–le la–mise–en–œuvre–des de–plus–en–plus–de la–fin–de–la–guerre en–ce–qui–concerne–la dans–la–mesure–où–les à–la–suite–de–la est–de–plus–en–plus il–en–est–de–même de–la–mise–en–œuvre

9

Supplementary Material

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

1700 und die in zu der daß von ist erfahrung nicht sie den sich mit ein dem als des werden welche so durch bey das wird man sind sey können ich

1800 der und die in zu den von des ist nicht das dem so sich sie als mit auf ein es oder eine daß er auch werden durch man an aber

1900 der die und in den des zu von ist das sich nicht dem mit auf eine als im auch ein die für an so es sie durch bei dass nach

2000 der die und in von den zu des das sich ist mit auf nicht im für als eine die dem auch ein an werden es sie einer daß aus oder

Table S11. Top German 1-grams for arbitrary years.

10

Supplementary Material

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

1700 die–erfahrung daß–sie durch–die von–der in–der hohen–alters des–hohen zierde–des und–in sich–in in–denselben in–den in–dem hohen–alter der–gottesfurcht daß–ich von–den von–dem theils–an sie–nicht man–nicht ist–es erfahrung–und der–erfahrung dem–hohen daß–er an–andern zu–suchen zu–allen wird–ihm

1800 in–der und–die in–den von–der und–der auf–die durch–die von–dem in–dem mit–dem von–den mit–der daß–die in–die für–die über–die auf–den daß–sie und–in auch–die als–die eben–so aus–dem daß–er in–einem aus–der und–den und–das wenn–sie so–wie

1900 in–der in–den für–die und–die von–der auf–die durch–die und–der mit–der mit–dem über–die in–die dass–die bei–der in–dem von–den an–der auch–die aus–dem auf–den von–dem ist–die mit–den aus–der an–die an–den für–den bei–den sich–die nach–der

2000 in–der für–die in–den und–die auf–die und–der mit–der von–der mit–dem in–die über–die durch–die daß–die bei–der an–der sich–die auch–die für–den aus–der aus–dem auf–den auf–der vor–allem mit–den von–den ist–die nicht–nur an–die in–einem die–sich

Table S12. Top German 2-grams for arbitrary years.

Frontiers

11

Supplementary Material Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

1700 des–hohen–alters dem–hohen–alter von–jugend–auf sich–in–der größte–zierde–des einzig–und–allein die–größte–zierde zu–was–vor wird–ihm–so wir–nach–und wenn–wir–sie von–der–wahrheit und–in–dem theils–an–andern sie–sind–nicht sie–nicht–vor sich–auch–die nicht–nur–bey nach–und–nach mit–recht–von mit–diesem–doppelten mehr–und–mehr kräfte–des–heiligen in–welchen–sie in–der–jugend in–dem–verstande erfahrung–ist–es durch–die–erfahrung die–sich–in die–erfahrung–ist

1800 so–ist–es in–der–folge nach–und–nach und–in–der in–ansehung–der so–wie–die in–der–natur in–der–that eben–so–wenig in–so–fern in–der–welt mehr–oder–weniger asthenie–der–erregung der–andern–seite sich–in–der und–in–den die–in–der in–der–mitte auch–in–der so–kann–man auf–diese–art eine–art–von daß–sie–sich zu–gleicher–zeit der–natur–der und–es–ist in–beziehung–auf daß–er–sich und–von–der der–gewalt–des

1900 in–der–regel in–bezug–auf und–in–der in–der–that mehr–oder–weniger es–sich–um die–zahl–der sich–in–der die–in–der auch–in–der in–erster–linie in–der–mitte in–den–letzten bezug–auf–die in–diesem–falle in–der–nähe und–in–den auch–für–die auf–grund–der sich–in–den handelt–es–sich so–ist–es die–in–den auch–in–den eine–reihe–von in–der–weise auf–diese–weise rücksicht–auf–die mehr–und–mehr der–in–der

2000 in–der–regel die–in–der und–in–der sich–in–der im–rahmen–der im–hinblick–auf auch–in–der es–sich–um in–erster–linie in–der–ddr handelt–es–sich in–der–lage im–bereich–der im–zusammenhang–mit in–den–letzten vor–allem–die nicht–nur–die auch–für–die frankfurt–am–main in–diesem–zusammenhang in–der–bundesrepublik der–bundesrepublik–deutschland sich–in–den sich–auf–die nach–wie–vor die–frage–nach die–in–den hinblick–auf–die in–den–usa und–vor–allem

Table S13. Top German 3-grams for arbitrary years.

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

1700 die–größte–zierde–des wir–nach–und–nach sich–in–der–jugend sich–auch–die–erfahrung kräfte–des–heiligen–geistes die–sich–in–der die–erfahrung–ist–es zum–nutzen–gereichen–können zu–welcher–wir–nach zu–was–vor–einem zu–thun–verbunden–sind zu–dem–genüsse–der zu–allen–zeiten–gegen zu–allen–dingen–nütze zeiget–sich–auch–die wohl–nicht–auf–diese wird–nicht–mehr–von wird–man–nicht–über wird–ihm–nicht–zu wird–die–erfahrung–durch wir–werden–vielmehr–in wir–auch–den–werth wir–am–deutlichsten–aus wie–sie–sich–in wie–oft–haben–nicht wie–es–mit–den wie–dieses–oder–jenes werden–vielmehr–in–den werden–dieselben–von–der wenn–wir–sie–in

1800 von–zeit–zu–zeit auf–der–andern–seite der–natur–der–sache an–und–für–sich verminderung–der–gewalt–des so–viel–als–möglich in–der–mitte–des auf–der–einen–seite die–art–und–weise in–der–natur–der immer–mehr–und–mehr der–könig–von–siam in–absicht–auf–die auf–der–andern–seite in–hinsicht–auf–die in–so–fern–sie auf–irgend–eine–art aus–der–natur–der der–lehre–von–der der–absoluten–gewalt–des das–ding–an–sich so–ist–es–auch so–ist–es–doch der–dinge–an–sich kritik–der–reinen–vernunft die–lehre–von–der auf–den–heutigen–tag in–beziehung–auf–die an–die–stelle–der nur–unter–der–bedingung

1900 in–bezug–auf–die auf–dem–gebiete–der an–und–für–sich in–den–letzten–jahren mit–rücksicht–auf–die in–den–meisten–fällen in–der–nähe–der auf–der–anderen–seite es–sich–um–die in–der–nähe–des bis–zu–einem–gewissen von–zeit–zu–zeit es–sich–um–eine wenn–es–sich–um auf–der–einen–seite handelt–es–sich–um an–die–stelle–der zu–einem–gewissen–grade auf–den–ersten–blick auf–dem–wege–der im–laufe–der–zeit auf–dem–gebiete–des an–ort–und–stelle die–art–und–weise nicht–die–rede–sein in–der–mitte–des die–frage–nach–der in–erster–linie–die in–der–mitte–der in–bezug–auf–den

2000 im–hinblick–auf–die handelt–es–sich–um in–den–letzten–jahren auf–der–anderen–seite in–der–bundesrepublik–deutschland im–zusammenhang–mit–der auf–dem–gebiet–der die–frage–nach–der in–bezug–auf–die es–sich–um–eine auf–der–anderen–seite in–den–neuen–bundesländern auf–den–ersten–blick nicht–in–der–lage es–handelt–sich–um in–bezug–auf–die stellt–sich–die–frage in–der–zweiten–hälfte der–zweiten–hälfte–des auf–der–ebene–der auf–der–einen–seite vor–allem–in–der die–frage–nach–dem nach–dem–zweiten–weltkrieg vor–dem–hintergrund–der im–zusammenhang–mit–dem im–vergleich–zu–den vor–allem–in–den auf–der–grundlage–der in–der–lage–ist

Table S14. Top German 4-grams for arbitrary years.

12

Supplementary Material Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

1700 die–sich–in–der–jugend wird–nicht–mehr–von–den wird–man–nicht–über–die wir–nach–und–nach–zu wir–nach–und–nach–durch wir–auch–den–werth–der wir–am–deutlichsten–aus–den wie–sie–sich–in–allen wie–es–mit–den–menschen wenn–sie–sich–auf–den wenn–er–von–der–wahrheit sie–sind–nicht–nur–zu sich–nach–dem–lauf–der sich–ihr–ganzes–leben–hindurch sich–einzig–und–allein–zu noch–in–der–blüthe–ihrer nicht–auf–diese–weise–die nach–und–nach–zu–der nach–und–nach–durch–die nach–dem–lauf–der–natur mit–recht–von–sich–rühmen man–sich–in–seinem–eigenen in–der–heiligen–schrift–vor in–der–blüthe–ihrer–jahre in–den–gemüthern–der–menschen ich–nicht–zu–viel–sage es–ist–nicht–zu–leugnen es–ist–also–leicht–zu einzig–und–allein–aus–der einer–mehr–als–der–andere

1800 bis–auf–den–heutigen–tag aus–der–natur–der–sache wie–ich–schon–gesagt–habe an–und–für–sich–selbst über–das–verhalten–des–phosphors versuche–über–das–verhalten–des der–natur–der–sache–nach in–der–natur–der–sache des–raums–und–der–zeit des–menschen–gegen–sich–selbst die–eine–oder–die–andere versteht–es–sich–von–selbst und–auf–der–andern–seite aus–dem–wege–zu–räumen a–n–m–e–r religion–innerhalb–der–grenzen–der in–den–stand–zu–setzen geschichte–der–lehre–von–der das–moralische–gesetz–in–seine in–der–kasse–der–bank grundlegung–zur–metaphysik–der–sitten des–angriffs–und–der–vertheidigung nicht–mehr–und–nicht–weniger liegt–in–der–natur–der in–der–ersten–hälfte–des daseyn–gottes–und–die–unsterblichkeit wenn–sie–auch–noch–so so–viel–als–möglich–zu raum–und–in–der–zeit moralische–gesetz–in–seine–maxime

1900 bis–zu–einem–gewissen–grade bis–auf–den–heutigen–tag in–der–zweiten–hälfte–des in–der–mehrzahl–der–fälle in–der–ersten–hälfte–des der–kaiserlichen–akademie–der–wissenschaften in–der–natur–der–sache zum–wirklichen–mitgliede–ernannt–am die–eine–oder–die–andere an–der–universität–in–wien dass–es–sich–hier–um wenn–es–sich–um–die dass–es–sich–um–eine nicht–die–rede–sein–kann der–natur–der–sache–nach von–der–hand–zu–weisen bis–in–die–neueste–zeit handelt–es–sich–um–eine die–art–und–weise–der den–vereinigten–staaten–von–amerika der–einen–oder–der–anderen handelt–es–sich–um–die reich–gottes–ist–in–euch das–reich–gottes–ist–in schon–an–und–für–sich es–liegt–auf–der–hand es–handelt–sich–hier–um ist–hier–nicht–der–ort wo–es–sich–um–die verhält–es–sich–mit–der

2000 in–der–zweiten–hälfte–des handelt–es–sich–um–eine in–der–ersten–hälfte–des in–der–zweiten–hälfte–der kölner–zeitschrift–für–soziologie–und zeitschrift–für–soziologie–und–sozialpsychologie es–handelt–sich–dabei–um handelt–es–sich–um–die auf–der–einen–seite–und o–o–o–o–o handelt–es–sich–um–einen handelt–es–sich–um–ein dabei–handelt–es–sich–um es–handelt–sich–um–eine stellt–sich–die–frage–nach auch–im–hinblick–auf–die sich–in–den–letzten–jahren das–gilt–insbesondere–für–vervielfältigungen mikroverfilmungen–und–die–einspeicherung–und hierbei–handelt–es–sich–um bis–zu–einem–gewissen–grad und–verarbeitung–in–elektronischen–systemen einspeicherung–und–verarbeitung–in–elektronischen und–die–einspeicherung–und–verarbeitung grenzen–des–urheberrechtsgesetzes–ist–ohne die–einspeicherung–und–verarbeitung–in mehr–als–die–hälfte–der im–wahrsten–sinne–des–wortes in–erster–linie–auf–die nicht–mehr–in–der–lage

Table S15. Top German 5-grams for arbitrary years.

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

1700 di del e in la della il che per non a n al con un alla da rei c ed cui è sez civ dei una o nel delle le

1800 di e che la il in a per non si del le i della un da ed con più una de al è delle o nel alla gli l se

1900 di e che la il in del a per non si della un le è i una da dei con al più delle nel ed alla o come gli nella

2000 di e che la in il a del della per un è non si una le con i da dei al nel delle come più alla o la anche nella

Table S16. Top Italian 1-grams for arbitrary years.

Frontiers

13

Supplementary Material

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

1700 di–un di–cui in–cui ai–sensi con–rinvio cassa–con per–la di–lavoro di–una del–giudice da–parte per–il e–non ex–art con–la non–è che–il tema–di non–può nei–confronti che–la di–merito caso–di in–sede ed–altro ed–altri a–norma secondo–comma giudice–di in–tema

1800 e–di che–si e–la che–non per–la e–che che–il che–la non–si e–per di–un tutte–le e–le la–sua il–quale che–in ed–il tutti–i che–i di–questa di–questo che–le la–quale e–non in–un che–per i–quali e–si di–lui e–con

1900 e–di per–la che–si che–il e–la che–la di–un che–non non–si non–è e–che di–una la–sua e–il in–cui e–non e–le in–un il–suo tutte–le per–le e–per di–cui con–la il–quale per–il tutti–i e–si la–quale si–è

2000 di–un di–una e–di per–la e–la che–si con–la e–il in–cui che–la per–il la–sua che–il che–non non–è in–un con–il il–suo e–le non–si di–cui si–è e–che in–una e–non da–un che–è e–della e–i per–le

Table S17. Top Italian 2-grams for arbitrary years.

14

Supplementary Material

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

1700 cassa–con–rinvio in–tema–di caso–in–cui in–sede–di da–parte–del competenza–e–giurisdizione datore–di–lavoro giudice–di–merito la–corte–suprema ne–consegue–che nel–caso–in con–la–conseguenza imposte–e–tasse del–giudice–di ricorso–per–cassazione corte–suprema–ha la–conseguenza–che in–cui–il in–caso–di del–datore–di per–pubblica–utilità ai–fini–della cassa–senza–rinvio nel–caso–in rapporto–di–lavoro nel–caso–di espropriazione–per–pubblica in–tema–di nei–confronti–del al–fine–di

1800 che–non–si di–tutte–le di–tutti–i il–nome–di per–lo–più tutti–gli–altri e–per–la in–tutte–le tutto–ciò–che poco–a–poco più–o–meno e–che–non vale–a–dire di–quello–che se–non–che la–maggior–parte in–cui–si non–si–può in–tutti–i a–poco–a la–qual–cosa per–la–sua e–gli–altri il–duca–di di–tutto–il di–cui–si la–prima–volta di–tutti–gli che–non–è per–mezzo–di

1900 più–o–meno non–si–può che–non–si di–tutte–le in–cui–si il–diritto–di di–tutti–i e–per–la il–nome–di disegno–di–legge non–può–essere per–la–sua punto–di–vista la–prima–volta che–non–è in–tutte–le in–tutti–i si–può–dire in–cui–la consiglio–di–stato si–tratta–di in–cui–il una–specie–di e–la–sua in–modo–che tutti–gli–altri in–gran–parte tutto–ciò–che di–cui–si una–serie–di

2000 a–cura–di punto–di–vista in–grado–di una–serie–di in–cui–si una–sorta–di la–possibilità–di il–fatto–che da–parte–di si–tratta–di più–o–meno la–prima–volta in–materia–di che–non–si e–la–sua al–di–là da–un–lato in–questo–caso dal–punto–di in–cui–il di–tutte–le da–parte–del in–cui–la al–fine–di per–la–prima si–tratta–di di–tutti–i per–la–sua non–può–essere la–presenza–di

Table S18. Top Italian 3-grams for arbitrary years.

Frontiers

15

Supplementary Material Rank 1700 1800 1 nel–caso–in–cui a–poco–a–poco 2 con–la–conseguenza–che c–sol–fa–ut 3 la–corte–suprema–ha il–più–delle–volte 4 del–datore–di–lavoro sotto–il–nome–di 5 del–giudice–di–merito commissione–straordinaria–di–governo 6 espropriazione–per–pubblica–utilità il–duca–di–savoja 7 nel–caso–in–cui g–sol–re–ut 8 in–sede–di–legittimità di–mano–in–mano 9 specie–la–corte–suprema per–lo–spazio–di 10 nella–specie–la–corte per–la–qual–cosa 11 considerazione–del–fatto–che per–la–prima–volta 12 caso–in–cui–il in–tutta–la–sua 13 in–considerazione–del–fatto a–la–mi–re 14 corte–suprema–ha–confermato come–si–è–detto 15 da–parte–del–giudice di–giorno–in–giorno 16 ha–confermato–la–sentenza di–tutti–gli–altri 17 provvedimenti–impugnabili–o–no d–la–sol–re 18 suprema–ha–confermato–la per–la–qual–cosa 19 di–cui–agli–artt il–capo–del–femore 20 per–la–prima–volta per–la–maggior–parte 21 infortuni–sul–lavoro–e di–tempo–in–tempo 22 sul–lavoro–e–malattie di–tutte–le–cose 23 lavoro–e–malattie–professionali a–un–di–presso 24 onere–di–prova–del di–tutto–ciò–che 25 opposizione–agli–atti–esecutivi di–cui–si–tratta 26 nella–parte–in–cui del–fondo–di–religione 27 ai–sensi–degli–artt del–re–di–francia 28 sentenza–della–corte–costituzionale in–uno–stato–di 29 il–giudice–di–merito in–tutte–le–sue 30 il–ricorso–per–cassazione la–repubblica–di–venezia Table S19. Top Italian 4-grams for arbitrary years.

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

1700 nella–specie–la–corte–suprema in–considerazione–del–fatto–che specie–la–corte–suprema–ha la–corte–suprema–ha–confermato corte–suprema–ha–confermato–la sul–lavoro–e–malattie–professionali infortuni–sul–lavoro–e–malattie nel–caso–in–cui–il suprema–ha–confermato–la–sentenza sentenza–della–corte–costituzionale–n per–la–prima–volta–in ricorso–per–cassazione–ai–sensi ha–confermato–la–sentenza–impugnata la–sentenza–di–merito–che nella–parte–in–cui–non e–opposizione–agli–atti–esecutivi oppure–confermano–principi–giurisprudenziali–pacifici nel–caso–in–cui–un nel–caso–in–cui–la massimario–della–corte–di–cassazione data–di–entrata–in–vigore considerazione–del–fatto–che–il con–la–conseguenza–che–il accertamento–del–giudice–di–merito sono–quelle–ufficiali–redatte–a sentenza–di–merito–che–aveva rappresentanza–e–difesa–in–giudizio quelle–ufficiali–redatte–a–cura pubblicate–sono–quelle–ufficiali–redatte nel–caso–in–cui–il

1800 in–tutta–la–sua–estensione tutto–il–territorio–della–repubblica primo–console–della–repubblica–francese munita–del–sigillo–della–repubblica di–nuovo–e–da–capo in–tutte–le–sue–parti in–tutto–il–territorio–della diritto–di–esercitare–la–professione entro–il–termine–di–giorni ed–il–duca–di–savoja che–il–più–delle–volte il–diritto–di–esercitare–la tutte–le–parti–del–corpo le–arterie–e–le–vene della–repubblica–francese–una–ed amministrazione–del–fondo–di–religione repubblica–francese–una–ed–indivisibile in–tutto–il–corso–della in–questo–stato–di–cose di–mano–in–mano–che del–primo–console–della–repubblica che–di–mano–in–mano ad–udire–la–parola–di la–quantità–di–calore–che il–termine–di–giorni–quindici gli–uni–che–gli–altri abilitato–ad–esercitare–la–professione a–poco–a–poco–si v–v–i–s–o un–non–so–che–di

1900 per–la–prima–volta a–poco–a–poco dal–punto–di–vista del–consiglio–di–stato si–può–dire–che sotto–il–nome–di ha–facoltà–di–parlare di–grazia–e–giustizia di–previsione–della–spesa della–spesa–del–ministero del–disegno–di–legge previsione–della–spesa–del un–gran–numero–di stato–di–previsione–della in–questi–ultimi–anni un–certo–numero–di dello–stato–di–previsione il–disegno–di–legge ministro–dei–lavori–pubblici il–più–delle–volte nel–caso–in–cui in–tutta–la–sua per–ciò–che–riguarda che–non–si–può legge–comunale–e–provinciale una–vera–e–propria segretario–di–stato–per in–tutti–i–casi ministero–dei–lavori–pubblici che–si–tratti–di

2000 dal–punto–di–vista per–la–prima–volta nel–momento–in–cui una–vera–e–propria un–vero–e–proprio un–punto–di–vista di–volta–in–volta è–in–grado–di di–cui–al–comma nel–caso–in–cui questo–punto–di–vista si–tratta–di–un a–che–fare–con si–tratta–di–un da–un–punto–di al–di–là–di nella–misura–in–cui dal–punto–di–vista si–tratta–di–una nella–seconda–metà–del in–un–certo–senso come–si–è–detto si–tratta–di–una del–presidente–della–repubblica come–si–è–visto di–una–serie–di al–di–là–della il–modo–in–cui si–può–dire–che per–quanto–riguarda–la

1900 previsione–della–spesa–del–ministero di–previsione–della–spesa–del stato–di–previsione–della–spesa dello–stato–di–previsione–della disegni–di–legge–e–relazioni cassa–dei–depositi–e–prestiti della–legge–comunale–e–provinciale sezione–del–consiglio–di–stato in–tutto–o–in–parte delle–leggi–e–dei–decreti ufficiale–delle–leggi–e–dei ministro–segretario–di–stato–per leggi–e–dei–decreti–del mandando–a–chiunque–spetti–di ministro–di–grazia–e–giustizia del–fondo–per–il–culto e–dei–decreti–del–regno atti–del–governo–a–f dal–punto–di–vista–della raccolta–ufficiale–delle–leggi–e nella–gazzetta–ufficiale–del–regno ordiniamo–che–il–presente–decreto gli–uni–e–gli–altri nella–raccolta–ufficiale–delle–leggi nella–maggior–parte–dei–casi osservarlo–e–di–farlo–osservare iv–sezione–del–consiglio–di munito–del–sigillo–dello–stato delle–poste–e–dei–telegrafi di–grazia–e–giustizia–e

2000 da–un–punto–di–vista decreto–del–presidente–della–repubblica da–questo–punto–di–vista dal–punto–di–vista–della data–di–entrata–in–vigore da–questo–punto–di–vista nella–maggior–parte–dei–casi presidente–del–consiglio–dei–ministri per–la–prima–volta–in non–è–in–grado–di di–una–vera–e–propria del–presidente–del–consiglio–dei dal–punto–di–vista–del di–un–vero–e–proprio sia–dal–punto–di–vista di–entrata–in–vigore–della del–decreto–del–presidente–della da–un–punto–di–vista la–questione–di–legittimità–costituzionale per–la–prima–volta–nel presidenza–del–consiglio–dei–ministri in–tutto–o–in–parte non–è–un–caso–che a–che–fare–con–la di–stampare–nel–mese–di nel–momento–in–cui–si ha–a–che–fare–con finito–di–stampare–nel–mese anche–dal–punto–di–vista dalla–data–di–entrata–in

Table S20. Top Italian 5-grams for arbitrary years.

16

Supplementary Material

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

1700 и день на въ то а въ не города го пошли изъ до во по капитанъ по й и утру пришли полки того къ ь при ночь была что также

1800 и на не что въ по въ его а для то но или за было до же только при о къ они того были бы какъ во я у отъ

1900 и въ на не что по его для какъ а къ но за о или то же при только было отъ изъ это до я такъ у бы время въ

2000 и в на не что по к как в из а о его от для он за же это я или но то у и их было только до был

Table S21. Top Russian 1-grams for arbitrary years.

Frontiers

17

Supplementary Material

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

1700 по–утру й–день изъ–города также–и во–вторникъ го–года въ–неделю того–же однако–жъ на–день въ–вечеру была–изъ шли–во чинить–не февраля–въ сентября–день пришли–и пришелъ–и пргьхали–на пошли–на пошли–и пошелъ–на потому–что полки–пришли по–городу переменя–лошадей октября–день но–только не–могли не–дошедъ

1800 и–не и–на потому–что и–по во–время и–въ не–было и–въ не–только что–на для–того что–они но–и и–что и–проч что–не а–не что–въ то–же и–для что–и на–него и–при то–время можно–было еще–не но–не на–обороть какъ–и не–можетъ

1900 и–въ и–не и–на не–только такъ–какъ можетъ–быть но–и что–въ такъ–и не–было и–по потому–что а–также не–можетъ какъ–и во–время что–онъ а–не и–что и–его въ–томъ о–томъ то–же и–для у–него въ–то еще–не и–даже на–то что–и

2000 и–в и–не в–том а–также не–только что–в но–и так–и может–быть о–том и–на в–этом не–было что–он и–его как–и из–них а–в не–может так–как таким–образом а–не потому–что как–бы не–в у–него то–же в–его не–менее во–время

Table S22. Top Russian 2-grams for arbitrary years.

18

Supplementary Material Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

1700 была–изъ–города чинить–не–могли по–городу–бить не–дошедъ–до хотя–они–отъ только–два–человека того–же–имени также–и–про также–и–день реку–на–другую пушками–и–мортирами путь–и–шли пришли–къ–городу пошли–на–море пошли–въ–путь потомъ–же–и потому–что–онъ полки–перешли–на поел–ь–того подъ–его–командою по–утру–рано по–а–что перешли–на–другую переименовали–его–въ они–только–въ ночь–на–день но–только–два ничего–не–в никакого–вреда–его не–токмо–что

1800 въ–то–время то–же–время на–другой–день на–другой–день мало–по–малу бы–то–ни и–того–ради не–только–не то–ни–было не–взирая–на въ–то–же и–для–того въ–это–время не–иное–что по–крайней–мере не–можетъ–быть и–во–время въ–это–время преподобнаго–отца–нашего не–можно–было такъ–какъ–и а–на–обороть хотя–и–не по–тому–что императора–петра–великаго и–при–томъ въ–то–время равно–какъ–и не–что–иное на–что–на

1900 то–же–время въ–то–время не–можетъ–быть въ–то–же бы–то–ни то–ни–было по–крайней–мере въ–это–время не–только–не а–также–и не–только–въ такъ–и–въ и–въ–то не–могутъ–быть хотя–и–не можно–было–бы равно–какъ–и какъ–и–въ но–и–въ а–равно–и въ–это–время къ–тому–же въ–то–время и–то–же въ–первый–разъ на–этотъ–разъ одно–и–то и–такимъ–образомъ на–другой–день по–этому–поводу

2000 то–же–время по–отношению–к в–то–время в–первую–очередь не–может–быть в–то–же по–крайней–мере в–то–же в–отличие–от не–только–в тем–не–менее тем–не–менее так–и–не в–это–время так–и–в то–время–как бы–то–ни как–и–в той–или–иной в–конце–концов а–также–в к–тому–же к–тому–же но–и–в думы–федерального–собрания в–отличие–от речь–идет–о можно–было–бы в–том–же то–ни–было

Table S23. Top Russian 3-grams for arbitrary years.

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

1800 въ–то–же–время бы–то–ни–было не–взирая–на–то что–на–что–на во–что–бы–то какъ–бы–то–ни что–въ–то–время что–бы–то–ни въ–то–же–время одно–и–то–же и–въ–то–же что–въ–это–время то–же–время–и но–въ–то–же не–задолго–до–того не–говоря–уже–о по–одному–или–по по–два–и–по одному–или–по–два одной–и–той–же не–взирая–ни–на не–было–ни–одного на–что–на–что на–роды–и–виды и–въ–то–время два–и–по–три въ–то–же–время въ–то–время–въ узнать–было–не–можно то–время–не–было

1900 въ–то–же–время бы–то–ни–было и–въ–то–же одно–и–то–же одной–и–той–же одного–и–того–же въ–то–же–время что–бы–то–ни какъ–бы–то–ни во–что–бы–то не–говоря–уже–о но–въ–то–же то–же–время–и въ–то–время–какъ и–то–же–время отъ–времени–до–времени и–въ–тоже–время и–не–можетъ–быть какой–бы–то–ни ни–за–что–не въ–одно–и–то что–въ–то–время не–можетъ–быть–и и–на–этотъ–разъ не–говоря–уже–о такъ–же–какъ–и одну–и–ту–же одному–и–тому–же говоря–уже–о–томъ какихъ–бы–то–ни

2000 в–то–же–время в–то–же–время в–то–время–как бы–то–ни–было изменений–и–дополнений–в и–в–то–же одно–и–то–же в–той–или–иной совета–федерации–федерального–собрания о–проекте–федерального–закона так–же–как–и ссср–по–обвинению–в ни–в–чем–не вквс–ссср–по–обвинению федерального–собрания–российской–федерации один–и–тот–же не–говоря–уже–о одного–и–того–же одной–и–той–же что–бы–то–ни дополнений–в–федеральный–закон одни–и–те–же во–что–бы–то государственной–думы–федерального–собрания думы–федерального–собрания–российской но–в–то–же и–дополнений–в–федеральный и–тем–не–менее постановление–государственной–думы–федерального то–же–время–в

Table S24. Top Russian 4-grams for arbitrary years.

Frontiers

19

Supplementary Material Rank 1800 1900 1 какъ–бы–то–ни–было и–въ–то–же–время 2 во–что–бы–то–ни какъ–бы–то–ни–было 3 и–въ–то–же–время во–что–бы–то–ни 4 но–въ–то–же–время но–въ–то–же–время 5 по–одному–или–по–два одно–и–то–же–время 6 по–два–и–по–три какой–бы–то–ни–было 7 кому–бы–то–ни–было въ–одно–и–то–же 8 какого–бы–то–ни–было въ–то–же–время–и 9 такъ–какъ–въ–то–время какихъ–бы–то–ни–было 10 одного–берега–озера–на–другой какого–бы–то–ни–было 11 на–три–и–на–четыре въ–одной–и–той–же 12 на–два–и–на–три въ–то–же–время–не 13 которой–не–было–еще–и какимъ–бы–то–ни–было 14 какой–бы–то–ни–было не–говоря–уже–о–томъ 15 во–что–бы–то–пи но–въ–то–же–время 16 что–она–не–только–не какъ–бы–то–ни–было 17 что–ни–одна–птица–не не–говоря–уже–о–томъ 18 что–въ–то–время–въ не–можетъ–быть–и–речи 19 потому–что–она–не–только кого–бы–то–ни–было 20 потому–что–въ–это–время къ–одному–и–тому–же 21 потому–что–въ–то–время на–одной–и–той–же 22 по–три–и–по–четыре на–одномъ–и–томъ–же 23 одно–и–то–же–время какое–бы–то–ни–было 24 но–не–взирая–на–то какомъ–бы–то–ни–было 25 но–и–то–и–другое не–одно–и–то–же 26 но–и–между–ними–и кому–бы–то–ни–было 27 ни–у–кого–не–брать какими–бы–то–ни–было 28 ни–для–кого–не–тайна по–одному–и–тому–же 29 не–говоря–уже–о–томъ въ–одномъ–и–томъ–же 30 на–три–года–и–одиннадцать къ–одной–и–той–же Table S25. Top Russian 5-grams for arbitrary years.

2000 и–в–то–же–время вквс–ссср–по–обвинению–в во–что–бы–то–ни государственной–думы–федерального–собрания–российской думы–федерального–собрания–российской–федерации и–дополнений–в–федеральный–закон изменений–и–дополнений–в–федеральный но–в–то–же–время постановление–государственной–думы–федерального–собрания изменений–и–дополнений–в–закон гд–постановление–государственной–думы–федерального в–то–же–время–в ш–гд–постановление–государственной–думы в–то–время–как–в как–бы–то–ни–было р–распоряжение–правительства–российской–федерации в–той–или–иной–мере каких–бы–то–ни–было какой–бы–то–ни–было а–в–дву–по–тому в–дву–по–тому–ж федерации–федерального–собрания–российской–федерации совета–федерации–федерального–собрания–российской и–в–то–же–время постановление–совета–федерации–федерального–собрания но–в–то–же–время в–одном–и–том–же ни–в–чем–не–бывало ни–в–коей–мере–не как–ни–в–чем–не

20

Supplementary Material

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

1700 de que y la el en los con á se no por del las lo para un su le es a al fe mas como una fu ha me tan

1800 de que y la en el los se á las del por con no su es para mas lo un al una como sus ó este esta le si ha

1900 de la que y el en los á del se las por no con para su al un lo una es como a ó sus el ha i más la

2000 de la y que en el a los del se las por un con una no para su es al como lo o más la el sus en sobre entre

Table S26. Top Spanish 1-grams for arbitrary years.

Frontiers

21

Supplementary Material

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

1700 de–la de–los en–la en–el lo–que que–se que–no de–las de–que con–el de–su que–le á–la que–fe en–que que–el que–en que–la y–que y–el que–es para–que por–la con–la á–los por–el y–de y–en en–los en–las

1800 de–la de–los que–se en–el de–las en–la á–la de–su lo–que y–de á–los que–no que–el que–la en–los y–que y–la en–que que–en de–que y–el en–las que–los todos–los de–sus no–se y–en los–que con–la con–el

1900 de–la de–los en–el que–se en–la de–las á–la por–el que–el á–los que–no de–su que–la lo–que por–la y–de en–los de–que en–que de–un y–el en–las y–la con–el que–en y–en en–su y–que con–la de–una

2000 de–la de–los en–el en–la a–la de–las que–se a–los y–la lo–que de–un y–el en–los que–la que–el de–su de–una y–de por–el con–el por–la con–la en–las que–no a–las de–que en–su en–un de–sus que–los

Table S27. Top Spanish 2-grams for arbitrary years.

22

Supplementary Material

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

1700 de–lo–que uno–de–los a–que–respondió que–fe–le de–que–se por–lo–que de–los–que de–la–real en–que–se á–que–respondió rey–nueftro–señor que–en–el mil–ducados–de lo–que–se en–lo–que el–duque–de que–se–le de–san–joseph y–de–la una–porcion–de se–ha–de la–iglesia–de veinte–y–quatro que–en–la iglesia–de–san de–que–no que–es–el lo–que–fe en–la–llama de–san–luis

1800 y–de–la en–que–se que–no–se de–lo–que de–los–que todo–lo–que de–todos–los lo–que–se por–lo–que por–medio–de y–de–los de–que–se á–los–que de–todas–las que–en–el de–la–tierra al–mismo–tiempo uno–de–los se–ha–de de–los–hombres que–se–ha el–nombre–de de–la–vida de–la–naturaleza y–en–el la–mayor–parte y–de–las y–en–la que–es–el que–en–la

1900 de–la–república en–que–se de–la–ley los–estados–unidos y–de–la que–no–se uno–de–los de–que–se de–buenos–aires presidente–de–la de–lo–que lo–que–se que–se–le la–ley–de la–ciudad–de de–los–que por–medio–de de–la–misma de–los–estados que–en–el de–la–ciudad de–todos–los á–fin–de á–que–se una–de–las y–en–el y–de–los que–se–ha la–comisión–de y–en–la

2000 a–través–de uno–de–los y–de–la a–partir–de de–lo–que en–el–que el–caso–de una–de–las lo–que–se en–la–que de–la–república de–la–vida por–lo–que que–no–se de–la–población de–la–sociedad de–la–ciudad y–a–la en–que–se parte–de–la a–pesar–de la–necesidad–de y–en–la y–en–el la–mayoría–de que–en–el los–estados–unidos de–que–el la–ciudad–de en–el–caso

Table S28. Top Spanish 3-grams for arbitrary years.

Frontiers

23

Supplementary Material Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

1700 la–iglesia–de–san iglesia–de–san–joseph del–rey–nueftro–señor veinte–y–quatro–mil el–papel–azul–de á–un–fuego–fuerte y–quatro–mil–ducados en–la–forma–que un–baño–de–arena los–veinte–y–quatro en–un–baño–de en–la–llama–interior por–la–via–seca por–el–ácido–nitroso la–reyna–nueftra–señora la–de–san–luis en–un–crisol–de de–los–veinte–y se–disuelve–en–el que–es–lo–que por–medio–del–soplete mufla–de–un–horno mil–ducados–de–ayuda mas–claro–que–el la–mufla–de–un en–la–casa–del el–hierro–y–la ducados–de–ayuda–de de–un–horno–de de–la–real–hazienda

1800 la–mayor–parte–de á–fin–de–que por–medio–de–la mayor–parte–de–los con–el–nombre–de al–mismo–tiempo–que cada–uno–de–los que–se–ha–de de–lo–que–se y–al–mismo–tiempo de–todo–lo–que no–es–mas–que cada–una–de–las es–lo–mismo–que no–es–otra–cosa que–se–han–de del–mismo–modo–que el–modo–con–que de–los–que–se es–uno–de–los para–que–no–se es–una–de–las mayor–parte–de–las circulacion–de–la–sangre en–el–año–de todo–lo–que–se con–el–fin–de en–el–caso–de la–circulacion–de–la lo–que–se–ha

1900 presidente–de–la–república de–los–estados–unidos de–la–ley–de el–presidente–de–la á–fin–de–que en–el–caso–de á–que–se–refiere la–mayor–parte–de de–la–ciudad–de lo–dispuesto–en–el cada–uno–de–los que–se–refiere–el de–que–se–trata y–dése–al–registro de–la–provincia–de dése–al–registro–nacional de–la–ley–de en–la–ciudad–de dicho–francisco–de–villagra en–los–estados–unidos con–el–objeto–de de–acuerdo–con–el el–caso–de–que con–el–nombre–de de–la–comisión–de días–del–mes–de cada–una–de–las la–ciudad–de–santiago de–mil–ochocientos–noventa mil–ochocientos–noventa–y

2000 a–través–de–la a–lo–largo–de en–el–caso–de el–hecho–de–que el–punto–de–vista a–partir–de–la la–mayoría–de–los de–los–estados–unidos con–el–fin–de en–el–que–se cada–uno–de–los la–mayor–parte–de de–la–universidad–de desde–el–punto–de en–la–ciudad–de la–década–de–los de–la–ciudad–de en–la–que–se en–el–caso–de la–medida–en–que en–el–sentido–de a–través–de–los cada–una–de–las el–caso–de–la a–la–hora–de por–parte–de–los de–la–ley–de a–lo–largo–del en–la–medida–en en–el–marco–de

Table S29. Top Spanish 4-grams for arbitrary years.

Rank 1700 1800 1 veinte–y–quatro–mil–ducados la–mayor–parte–de–los 2 los–veinte–y–quatro–mil la–circulacion–de–la–sangre 3 en–un–baño–de–arena la–mayor–parte–de–las 4 mufla–de–un–horno–de no–es–otra–cosa–que 5 mil–ducados–de–ayuda–de que–la–mayor–parte–de 6 la–mufla–de–un–horno por–lo–que–hace–á 7 se–disuelve–en–el–agua por–la–gracia–de–dios 8 de–los–veinte–y–quatro de–la–mayor–parte–de 9 un–sabor–dulce–al–principio no–es–mas–que–un 10 mas–claro–que–el–del de–mil–setecientos–treinta–y 11 la–octava–parte–de–la la–superficie–de–la–tierra 12 la–mayor–parte–de–los todas–las–partes–del–cuerpo 13 es–semejante–á–la–que no–es–mas–que–una 14 en–las–paredes–del–vaso en–la–mayor–parte–de 15 en–la–mufla–de–un en–el–tiempo–de–la 16 en–la–iglesia–de–san los–siglos–de–los–siglos 17 con–el–agua–de–cal la–mayor–parte–de–la 18 que–quedó–sobre–el–filtro lo–que–hace–á–la 19 que–es–una–mina–de la–parte–superior–de–la 20 poco–ó–nada–lo–que por–lo–que–toca–á 21 para–la–administracion–de–los de–que–acabamos–de–hablar 22 las–paredes–del–vaso–una y–esto–es–lo–que 23 la–sala–de–alcaldes–de rey–de–la–gran–bretaña 24 en–la–iglesia–parroquial–de á–cada–uno–de–los 25 de–qualquier–calidad–que–sean por–lo–que–toca–á 26 de–nuestra–señora–de–la de–la–academia–de–las 27 de–mil–setecientos–cincuenta–y de–cada–una–de–las 28 de–la–iglesia–de–san por–lo–que–hace–á 29 de–esta–villa–de–madrid de–la–circulacion–de–la 30 a–los–reales–pies–de como–se–puede–ver–en Table S30. Top Spanish 5-grams for arbitrary years.

1900 el–presidente–de–la–república á–que–se–refiere–el y–dése–al–registro–nacional de–mil–ochocientos–noventa–y el–dicho–francisco–de–villagra los–señores–por–la–afirmativa dirección–de–tierras–y–colonias la–dirección–de–tierras–y publiquese–y–dése–al–registro la–mayor–parte–de–los el–presidente–de–la–república en–el–caso–de–que la–administración–general–del–estado á–la–dirección–de–tierras que–se–refiere–el–artículo publíquese–y–dése–al–registro lo–dispuesto–en–el–artículo la–ciudad–de–buenos–aires de–estado–y–del–despacho lo–dispuesto–en–el–art estado–y–del–despacho–de de–la–ley–de–enjuiciamiento los–estados–unidos–de–américa de–los–estados–unidos–de servicio–de–dios–y–de desde–el–punto–de–vista la–mayor–parte–de–las han–de–uso–y–costumbre de–casación–por–infracción–de de–la–cámara–de–diputados

2000 desde–el–punto–de–vista en–la–medida–en–que el–punto–de–vista–de a–lo–largo–de–la en–el–sentido–de–que lo–que–se–refiere–a en–la–mayoría–de–los diario–oficial–de–la–federación la–segunda–mitad–del–siglo de–mil–novecientos–noventa–y la–mayor–parte–de–los de–la–década–de–los en–el–diario–oficial–de de–los–estados–unidos–mexicanos el–diario–oficial–de–la en–el–campo–de–la en–el–caso–de–la desde–el–punto–de–vista en–el–caso–de–los a–que–se–refiere–el en–el–marco–de–la américa–latina–y–el–caribe de–cada–uno–de–los la–mayoría–de–los–casos en–lo–que–se–refiere en–la–década–de–los en–el–ámbito–de–la a–lo–largo–de–los política–de–los–estados–unidos constitución–política–de–los–estados

24