... regression meth- ods is discussed and applied to real world data with great
success. ... 2.3 Google Books n-gram Corpus . ... 4.3 Testing significance of a
regression model . . . . . . . . ..... to time series modelling is based on fitting ARIMA2
or related mod- ... In simple linear regression, it is assumed that the true
relationship.
Masaryk University Faculty of Informatics
}
w A| y < 5 4 23 1 0 / -. , )+ ( %&' $ # !"
Æ
Automatic methods for detection of word usage over time ˇ Bakalárská práce
Ondˇrej Herman
Brno, jaro 2013
Declaration Hereby I declare, that this paper is my original authorial work, which I have worked out by my own. All sources, references and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.
Ondˇrej Herman
Advisor: RNDr. Vojtˇech Kováˇr iii
Abstract From a natural language corpus, word usage data over time can be extracted. To detect and quantify change in this data, automatic procedures can be employed. In this work, the theory of ordinary and robust regression methods is discussed and applied to real world data with great success. A Python implementation is included. Smoothing of time series and detection of seasonality is examined, but ultimately this path does not seem to give satisfactory results for the data explored.
v
Keywords corpus linguistics, time series, regression, trend, smoothing, periodicity detection
vii
Contents 1 2
3
4
5
6
7
Introduction . . . . . . . . . . . . . . . . . . . . . 1.1 Conventions and notation . . . . . . . . . . Corpora . . . . . . . . . . . . . . . . . . . . . . . . 2.1 British National Corpus . . . . . . . . . . . 2.2 Oxford English Corpus . . . . . . . . . . . . 2.3 Google Books n-gram Corpus . . . . . . . . Time series analysis . . . . . . . . . . . . . . . . 3.1 Classical time series decomposition . . . . 3.2 Neoclassical time series decomposition . . 3.3 Frequency domain approach . . . . . . . . Regression analysis . . . . . . . . . . . . . . . . 4.1 Simple linear regression . . . . . . . . . . . 4.1.1 Finding the least-squares fit . . . . 4.2 Statistical hypothesis testing . . . . . . . . . 4.3 Testing significance of a regression model . 4.4 F-test . . . . . . . . . . . . . . . . . . . . . . 4.5 t-test . . . . . . . . . . . . . . . . . . . . . . 4.6 Weighted linear regression . . . . . . . . . . 4.6.1 Choosing the weights . . . . . . . . 4.6.2 F-test . . . . . . . . . . . . . . . . . . 4.6.3 The coefficient of determination, R2 4.6.4 Adjusted R2 . . . . . . . . . . . . . Robust regression . . . . . . . . . . . . . . . . . 5.1 Theil-Sen estimator . . . . . . . . . . . . . . 5.2 Moore-Wallis Test . . . . . . . . . . . . . . . 5.3 Mann-Kendall test . . . . . . . . . . . . . . 5.4 Spearman’s ρ . . . . . . . . . . . . . . . . . Smoothing . . . . . . . . . . . . . . . . . . . . . . 6.1 Moving average filters . . . . . . . . . . . . 6.1.1 Simple moving averages . . . . . . 6.1.2 Weighted moving averages . . . . . 6.2 Exponential moving average . . . . . . . . . 6.2.1 Double exponential smoothing . . 6.3 Median filters . . . . . . . . . . . . . . . . . Analysis of seasonality . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 4 7 7 7 8 9 9 9 10 11 11 11 12 14 14 15 16 17 18 18 18 23 23 25 27 29 31 31 31 31 33 33 34 37 1
7.1
8
9
10 11 A B
2
Periodical behavior estimation . . . . . . . . . . . 7.1.1 Periodogram . . . . . . . . . . . . . . . . . 7.1.2 Welch’s method . . . . . . . . . . . . . . . 7.1.3 R.A. Fisher’s test . . . . . . . . . . . . . . . Implementation . . . . . . . . . . . . . . . . . . . . . . . 8.1 Requirements . . . . . . . . . . . . . . . . . . . . . 8.2 Configuration . . . . . . . . . . . . . . . . . . . . . 8.3 Scripts . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 freq_analyzer.py . . . . . . . . . . . . . . . 8.3.2 freq_sort.py . . . . . . . . . . . . . . . . . . 8.4 Modules . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 linreg.py . . . . . . . . . . . . . . . . . . . . 8.4.2 wlreg.py . . . . . . . . . . . . . . . . . . . . 8.4.3 theilsen.py . . . . . . . . . . . . . . . . . . 8.4.4 diffsign.py, spearman.py, mannkendall.py 8.4.5 reader.py . . . . . . . . . . . . . . . . . . . Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Slope estimation . . . . . . . . . . . . . . . . . . . . 9.1.1 Linear regression . . . . . . . . . . . . . . 9.1.2 Theil-Sen estimator . . . . . . . . . . . . . 9.1.3 Normalizing the slope . . . . . . . . . . . 9.2 Changepoint detection . . . . . . . . . . . . . . . . 9.3 Smoothing . . . . . . . . . . . . . . . . . . . . . . . 9.4 Periodicity detection . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . Preparing the Google n-gram datasets . . . . . . . . . Example data . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
38 38 38 39 41 41 41 41 41 42 43 43 43 44 44 44 47 47 47 47 48 48 50 50 51 53 57 59
1 Introduction Human language is a complex, continuously evolving phenomenon. The change is partly driven by the necessity to adequately represent the environment we spend our lives in, however, this is not the full picture. Various outdated words are kept in use to describe contemporary ideas (such as the verb to sail, which is now used to refer to the travel by any ship), while other ones seem to undergo change for no apparent reason (car instead of auto)[13]. The fact that language can change stems from its arbitrariness and conventionality, but whether such change is necessary, and what the true nature of such development is remains to be discovered. In the past, linguists used to characterize languages based on their own experience and introspection. This methodology can only reflect the nature of an idealized, subjective model, which is inherently frozen in time, unlike the empiric reality of an everyday speech act. The recent development of large corpora allows us to have a convenient and easily quantifiable view of language change based on actual data. Unfortunately, a corpus will be always plagued by the unavoidable transcription and systematic errors in its construction, and the fact that a language is an infinite object means that a corpus can never give a complete and impartial perspective of the language. However, as the size of a corpus increases, the bias becomes less pronounced.
1
0
−2
0
2
Figure 1.1: Plot of the logistic function
3
1. Introduction The usage of an isolated language feature over time can be modelled as a sigmoid function[4]. At first, the change is slow, but as the feature spreads through the population the rate of change rises, and then eventually begins to slow down again as the usage approaches saturation. Various functions that produce this shape exist – perhaps the most commonly used one is the logistic function P(t) = 1+1e−x which is shown in figure 1.1. Many different processes influence the frequency of usage of a single word at the same time, so fitting a single sigmoid curve to real world data is in practice almost never possible. However, most of the time it is safe to assume that the change in usage does not occur abruptly, but that its magnitude is locally smooth and approximately linear. Words tend to appear in groups, or bursts[15]. The specific nature of this grouping depends on various factors. For example, repeating a word in written language is considered to be stylistically undesirable, whereas it is not necessarily so. Similarly, proper nouns are an important category of words that tend to appear in isolated bursts. For commonly used words, the cumulative frequencies are large enough so that this burstiness is diluted, and the distribution of errors becomes approximately normal, as implied by the central limit theorem. On the other hand, for rare words it causes problems, as it is not possible to reliably decide if the apparent high usage frequency in one time period is caused by the appearance of a random burst, or if the usage of the word actually became higher.
1.1 Conventions and notation a¯ bˆ c0 uα f α,d1 ,d2 tα,d 4
mean of a1 , . . . , an estimated value of b empirically determined value of c α-quantile of the standard normal distribution α-quantile of the F distribution with d1 and d2 degrees of freedom α-quantile of the Student’s t distribution with d degrees of freedom
1. Introduction Φ I MT N −1
the standard normal cumulative distribution function identity matrix matrix transpose of M matrix inverse of N
5
2 Corpora 2.1 British National Corpus The British National Corpus (BNC)[26] is a general purpose corpus of the British variant of the English language. It consists of 100 million words from 4049 different text samples[5]. These samples vary in size, but most of them are approximately 40,000 words long. However, it is designed as a synchronic corpus, characterizing the language at a specific point in time, so the composition is not intentionally kept homogeneous over time, and the vast majority of samples carry dates from the latter years of the sampling period[3]. This imbalance introduces a significant bias mainly for rare words.
2.2 Oxford English Corpus The main source for the Oxford English Corpus (OEC) is a diverse selection of sites on the World Wide Web[27]. As of 2011, the corpus consists of approximately 2 billion words, and is being updated regularly. It is considerably better balanced than the BNC, and provides a rich representation of the written language.
×107 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 1969 1978 1987 1994
(a) British National Corpus
×108 5 4 3 2 1 0 2000 2003
2007 2010
(b) Oxford English Corpus
×1010 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 1550 1700 1850 2000
(c) Google ngrams
Figure 2.1: Total yearly word counts 7
2. Corpora
2.3 Google Books n-gram Corpus Perhaps the most comprehensive source of word usage data are the Google n-gram datasets[9]. The corpus, from which the datasets are derived, contains about 4% of all books ever printed[18]. However, the corpus itself is not available due to copyright reasons. Because of this, only the resulting aggregate yearly counts are available for sequences of up to five words, so only simple literal queries can be executed. The datasets are available in a few different variants: American and British English, French, Spanish, German, simplified Chinese, Russian and Hebrew. The best data is from the period between 1800 and 2000. This is because the total number of texts before the year 1800 is small, and after the year 2000 the structure of the corpus changed slightly, so these latter data points cannot be reliably compared to the older ones. Even though the corpus contains copious amounts of OCR errors, the sheer size of the data more than makes up for this deficiency.
8
3 Time series analysis A time series is a sequence of discretely spaced observations ( xi , yi ), where yi is the observation for the time period xi . The basic approaches to time series analysis can be classified as follows[8]: 1.
Time domain approach (a) classical time series decomposition based on regression analysis (b) neoclassical time series decomposition based on correlation analysis
2.
Frequency domain approach
3.1 Classical time series decomposition This method operates under the assumption that a time series only depends on time, and can be separated into mutually independent deterministic and indeterministic components y = t + s + e1 where t is a trend component, s a possible seasonal component and e consists of random fluctuations.
3.2 Neoclassical time series decomposition Also known as the Box-Jenkins methodology, the modern approach to time series modelling is based on fitting ARIMA2 or related models to the observations. Conceptually similar to smoothers3 , these models consider every component of the series to be a manifestation of some unknown, likely autocorrelated, stochastic process. 1. A multiplicative model y = tse can easily be converted to an equivalent additive model by taking the log of the series. 2. Autoregressive Integrated Moving Average 3. An ARIMA(0, 0, 1) model is equivalent to a simple moving average
9
3. Time series analysis Usually fitted to a time series after removing the trend and seasonal components, the obtained models provide possibly the best known solutions for many forecasting problems. Choosing and estimating the parameters to obtain a reasonable model necessitates human intervention and at least 50 but preferably 100 reliable observations[19], so they are not useful for automatic off-line processing. The models obtained using the Box-Jenkins methodology also do not provide a convenient way to quantify their properties.
3.3 Frequency domain approach These methods, which inspect the spectral properties of time series are based on Fourier analysis, which assumes the series to be a composition of sine waves of various frequencies and phases. The aim of these methods is to characterize the cyclic behaviour of the series. Even though the word frequency time series sampled yearly virtually never show significant periodicity, in more densely sampled datasets it is very common to find such behavior.
10
4 Regression analysis The aim of regression analysis is to investigate a possible relationship between two or more variables. In this case the variables are the word usage frequency yi and time xi .
4.1 Simple linear regression In simple linear regression, it is assumed that the true relationship between two variables, x and y, is linear: yi = a + bxi
(4.1)
where a, the slope, and b, the intercept, are unknown constants we are trying to estimate. The values of yi are not exactly known1 yi0 = yi + ei
(4.2)
where e is an unpredictable error component. 4.1.1 Finding the least-squares fit This solution to finding the regression coefficient is based on minimizing the sum of squared errors. While there are other methods to solve this problem, such as the least absolute deviations, the method of least squares has a few desirable properties: the equations have a unique analytical solution, the derivation is straightforward and the method is widely used and well studied. To estimate the values of a and b from a set of n observations ( x1 , y10 ), ( x2 , y20 ), . . . , ( xn , y0n ), the method of least squares[19, 8] can be used. That is, aˆ and bˆ such that the sum of squared errors e is minimal are to be found: n
e=
ˆ i − aˆ ei = yi0 − bx n
ˆ i − aˆ )2 ∑ ei2 = ∑ (yi0 − bx
i =1
(4.3) (4.4)
i =1
1. On the other hand, it is assumed that the values of xi are exactly known. Errors-in-variables models do away with this assumption.
11
4. Regression analysis This happens at the unique point where both of the partial derivatives of e with respect to aˆ and bˆ vanish. n n ∂e ∂e ˆ i − aˆ ) = 0 = ∑ 2ei i = −2 ∑ (yi0 − bx ∂a ∂a i =1 i =1
(4.5)
n n ∂e ∂ei ˆ i − aˆ ) xi = 0 = ∑ 2ei = −2 ∑ (yi0 − bx ∂b ∂b i =1 i =1
(4.6)
Rearranging these equations yields n
aˆ n + bˆ ∑ xi =
n
∑ yi0
(4.7)
∑ yi0 xi
(4.8)
n
i =1 n
i =1 n
i =1
i =1
i =1
aˆ ∑ xi + bˆ ∑ xi2 =
The equations (4.7) are known as the least-squares normal equations, and the solutions for aˆ and bˆ are[19] n 0 0 ˆb = ∑i=1 (yi − y¯ )( xi − x¯ ) = Syx Sxx ∑in=1 ( xi − x¯ )2 0 aˆ = y¯ − bˆ x¯
where x¯ =
1 n
∑in=1 xi and y¯ =
1 n
(4.9) (4.10)
∑in=1 yi .
4.2 Statistical hypothesis testing The tests described in the following chapters are based on the standard statistical framework. They exploit the fact that if a tested object has some property P, the distribution of the values of a test statistic Z is known. Additionally, how to calculate Z is known, so an educated guess can be made whether or not P holds. For illustration, when Z is normally distributed with mean zero and unit variance, its value will, approximately 95 % of the time, take on a value from the interval (−1.96, 1.96). That is, the value will fall somewhere inside the hatched area in the figure 4.1a, whose area 12
4. Regression analysis 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 −3
1.0 0.8 0.6 0.4 0.2 −2
−1
0
1
2
3
(a) probability density function
0.0 −3
−2
−1
0
1
2
3
(b) cumulative distribution function
Figure 4.1: The standard normal distribution (with mean zero and unit variance) is approximately 0.95. The p value is the probability that the value will fall outside of this interval, or 0.05 in this case. If a specific value of the test statistic is known, the respective pvalue can be determined using the cumulative distribution function. For Z, this function is the standard normal cumulative distribution function φ, which is shown in the figure 4.1b. p is obtained as[21] p = 2(1 − φ(| Z |))
(4.11)
For a given significance level α, the null hypothesis is rejected if p < α. An alternative (but equivalent) way of testing the significance level is finding a critical value and comparing it to the Z statistic, rejecting the null hypothesis when | Z | statistic is larger than the critical value. Some commonly used significance levels and their respective two-tailed2 critical values calculated as u1− α2 , where u is the quantile function of the standard normal distribution, are: α critical value
0.1 1.64
0.05 1.96
0.01 2.58
Choosing a significance level Because of the stochastic nature of statistics, there is always a chance that a hypothesis test will result in a wrong decision. When carrying 2. The value of the statistic can diverge both ways relative to the mean.
13
4. Regression analysis out a statistical test, the following outcomes can take place:
H0 rejected H0 not rejected
H0 is true Type I error Correct outcome
H0 is false Correct outcome Type II error
The value of α approximately corresponds to the probability of committing a type I error. However, as α decreases, β, the probability of committing a type II error, will necessarily increase. It is therefore necessary to pick a suitable and balanced value depending on the particular requirements and the specific data on which the tests are carried out.
4.3 Testing significance of a regression model Even though the estimated parameters aˆ and bˆ are the best ones in the sense that they minimize the sum of squared errors, it is possible that the chosen model does not actually describe the observations well. Namely, it is desirable to ensure that the slope of the regression line bˆ is non-zero, and that its estimated value is significant compared to the noise. That is, the hypotheses to be tested are[19] H0 : bˆ = 0 H1 : bˆ 6= 0
(4.12) (4.13)
4.4 F-test The total sum of squares can be decomposed into the regression sum of squares SSE and the residual sum of squares SSR [19]: n
Syy =
∑ (yi − y¯)2 = SSR + SSE =
i =1
n
n
i =1
i =1
∑ (yˆi − y¯)2 + ∑ (yi − yˆi )2 (4.14)
SSR represents the variability explained by the regression model and SSE the unexplained part. Using these values a test statistic for 4.12 can be constructed: 14
4. Regression analysis F0 =
SSR SSE /(n − 2)
(4.15)
Assuming that the null hypothesis holds, F0 follows the F distribution with 1 and n-2 degrees of freedom, therefore the series is considered to exhibit a statistically significant trend when | F0 | > F1−α,1,n−2 and the null hypothesis is rejected. 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0
2002 2005 2008
(a) Oxford English Corpus, p = 0.5143
1.2 1.0 0.8 0.6 0.4 0.2 0.0
1980 1990 2000
(b) Google ngrams, p = 4.3 × 10−10
3.0 2.5 2.0 1.5 1.0 0.5 0.0 −0.5
1976 1984 1992
(c) British National Corpus, p = 0.009
Figure 4.2: Linear regression models and their respective p-values obtained using the F-test calculated for the word ’carrot’ The series shown in figure 4.2a does not show any evidence of trend. On the other hand, the series in 4.2b shows a very significant trend. According to the result of the F-test in the case of the series shown 4.2c also exhibit a trend, but its steepnes in this case seems to be caused by the limited volume of text contained in the early years sampled by the corpus and the resulting non-normality of the data.
4.5 t-test The t statistic of a regression fit is computed as[19] t0 = √
bˆ Veˆ /Sxx
(4.16)
where Veˆ is the residual variance, and Sxx = ∑in=1 ( xi − x¯ )2 . In case the null hypothesis holds, t0 follows the t-distribution with n − 2 3. The unit of y is the logarithm of the relative frequency per million words, for the reasons explained in 4.6.4
15
4. Regression analysis degrees of freedom, so H0 is to be rejected when |t0 | > t1− α2 ,n−2 at the significance level α. In the case of a single regressor, this test yields exactly the same results as the F-test.
4.6 Weighted linear regression The use of an ordinary least squares assumes that the weight of every sample is the same. However, sometimes it is desirable for particular samples to have stronger influence over the resulting model, for example when the data is heteroskedastic4 . Given a diagonal matrix W = Iw, where w is the sequence of weights, the problem of finding the least squares estimator can be conveniently defined in matrix form[8, 19]. The model is yi = β 1 zi,1 + . . . + β d zi,d + ei
(4.17)
where zij is the j-th regressor. For example, if we are trying to fit a j −1
polynomial to the data, zij becomes xi
, and the model is
yi = β 1 + xi β 2 + xi2 β 3 + . . . + xid−1 β d The model (4.17) can be rewritten as (4.18)
Y = Zb + e where
y1 y2 Y = .. . yn
z1,1 z1,2 · · · z1,d z2,1 z2,2 · · · z 2,d Z = .. .. .. . . . . . . zn,1 zn,2 · · · zn,d
β1 β2 b = .. . βd
The least squares normal equations to find the estimators bˆ are
( Z T WZ )bˆ = Z T Wy
(4.19)
bˆ = ( X T WX )−1 ( X T Wy)
(4.20)
and the solution is[19]
4. The sample variances are not equal
16
4. Regression analysis 5
5
4
4
3
3
2
2
1
1
0 1965 1970 1975 1980 1985 1990 1995
0 1965 1970 1975 1980 1985 1990 1995
(a) W = total yearly count
(b) W = log(total yearly count)
Figure 4.3: Effect of the different weighting strategies shown on the word ’evil’ from the British National Corpus However, the numerical precision of matrix inverse is in many cases not very good, so using a linear solver directly on (4.19) is recommended.
4.6.1 Choosing the weights Weighted least squares models require the sample variances to be known with respect to each other5 and the weight of each sample should be the reciprocal of its variance[19]. The actual sample variances of the word frequencies from the corpus is unknown and infeasible to calculate, but they seem to be smaller for years in which the total amount of words contained in the corpus is large. The total amounts of words examined in a given period is known exactly, so their reciprocals can be used to approximate the variances. However, as shown in the figure 4.3a, using the total amounts as the weights directly places disproportionately strong emphasis on the periods with large amount of words. In the figure 4.3b the log-transformed counts are used with great success.
5. This might seem to be more restrictive than the ordinary least squares methods, but these require all the sample variances to be equal.
17
4. Regression analysis 4.6.2 F-test Both the F-test and the t-test previously defined on page 15 can be extended for weighted models with multiple regressors, but their interpretation is not the same in this case: the t-test decides the significance of a single regressor, while the F-test decides the significance of the model as a whole. The significance of isolated regressors is not very interesting for our application, so only the F-test is described. The hypotheses tested by this version of the F-test are[28] H0 : Y = Zβ 1 + e H1 : Y = Zβ + e
(4.21) (4.22)
that is, the null hypothesis is that the sample mean is a better predictor than the tested model. The F statistic is given as[6] Fd =
SSR /(d − 1) SSE /(n − d)
(4.23)
where SSR is the regression sum of squares, SSE is the residual sum of squares, n is the number of observations and d is the number of regressors. Assuming that the null hypothesis holds, Fd follows the F distribution with d − 1 and n − d degrees of freedom. That is, the null hypothesis is to be rejected at the significance level α when Fd > F1−α,d−1,n−d . 4.6.3 The coefficient of determination, R2 A measure of the variability in the data explained by the model is the coefficient of determination. It is also known as R2 and is defined as[19] SS SSR = 1− E (4.24) R2 = Sxx Sxx Because there is never more variability in any (sane) model than in the data, 0 ≤ R2 ≤ 1. The higher the value of R2 is, the better the model fits the data. 4.6.4 Adjusted R2 Adding variables to a model always results in an increase of the coefficient of determination, and a model with n − 1 free variables 18
4. Regression analysis 6
×104
12 10
5 4
8
3
6
2
4
1
2
0
1980
1990
2000
(a) the original series
0
1980
1990
2000
(b) log-transformed series
Figure 4.4: ’the’ from Google Ngrams will fit every set of samples perfectly6 . A modification[19] which accounts for the additional degrees of freedom can be defined as R2adj = 1 −
n−1 (1 − R2 ) n−p
(4.25)
where n is the number of samples and p is the number of the estimated variables. The R2adj , unlike R2 , only increases when the expanded model fits the data better than would be expected by chance alone. This statistic can be used to estimate the degree of the polynomial which fits the data best. The R2adj provides a very useful hint. For most of the series examined, its value reaches the maximum for quadratic polynomials. Applying a logarithmic transformation linearizes the regression line as can be seen in figure 4.4. Treating the models as multiplicative therefore yields better results. Even though the R adj might indicate otherwise, the higher-order models do not describe the word usage time series very well, as can be seen in figures 4.5 and 4.6. In the first one, the model with the highest R adj is cubic, even though a linear model would be just as good. The second series contains a discontinuity. The fitted polynomial of degree 4 hardly provides any useful information about the data. The local maxima do not correspond with any points of interest and 6. also known as kitchen sink regression
19
4. Regression analysis
0.76 0.74 0.72 0.70 0.68 0.66 0.64 0.62
1
2
3
4
5
(a) adjusted
6
7
8
9
4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0
1980
1990
2000
(b) model with maximal R2adj
R2
Figure 4.5: ’Smith’ from Google ngrams
0.85 0.80 0.75 0.70 0.65 0.60
1
2
3
4
5
(a) adjusted
6
R2
7
8
9
1.2 1.0 0.8 0.6 0.4 0.2 0.0 −0.2
1980
1990
Figure 4.6: ’Chernobyl’ from Google ngrams
20
2000
(b) model with maximal R2adj
4. Regression analysis the direction of growth coincides with the underlying series only on sparse, isolated intervals. Even though high-order models are usually not very salient themselves, they can be used to indicate the series which are not random and do not exhibit a linear trend either. Fitting a linear model can still be used to decide the approximate slope of the trend.
21
5 Robust regression The standard statistical methods are based on some assumptions that cannot always be met in practice. Namely that the error terms are normally distributed with known variances and mean zero. The methods presented in the following section do away with these requirements. Robust methods are also less sensitive to the presence of outliers. For example, the mean is not a robust estimator while the median is. Even though the idea of using ranks of the data points in place of their values is not a new one, only the computer revolution has made their usage feasible due to the laboriousness of sorting.
5.1 Theil-Sen estimator The Theil-Sen estimator[23] is a statistic used to estimate the slope of the regression line. Compared to the ordinary least squares method it does not rely on the regression function to follow a prespecified form, and the errors are not assumed to have a known distribution. It is model-free and non-parametric. The resulting estimate is a linear approximation of the trend. A useful property of the estimator is its high breakdown point of 29 %. That is, up to 29 % of the samples can be arbitrarily changed without having any influence on the estimated slope. For comparison, the breakdown point of a ordinary linear least squares estimator is 0 % – every change influences the resulting estimator. The Theil-Sen estimator is defined as the median of the pairwise slopes of the samples[22]: βˆ ts = med
yi − y j , xi − x j
i 6= j
(5.1)
An extension of this method which handles ties1 in x has been designed by Sen[23]. In the case of time series there are no such ties, so this adjustment is unnecessary. 1. Some of the variables have the same values.
23
5. Robust regression 5
3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 −0.5 1965 1970 1975 1980 1985 1990 1995
0 1965 1970 1975 1980 1985 1990 1995
(a) ’spice’
(b) ’snow’
4 3 2 1
Figure 5.1: Behavior of the Theil-Sen estimator for words encountered in the British National Corpus
To test the significance of the slope estimated using this method, the Mann-Kendall test described on page 27 is usually used. To construct a confidence interval, several methods are given by Wilcox[29], but for it to be accurate, either a large amount of samples is necessary, or the errors have to be normally distributed – the data obtained from the natural language corpora only rarely meet these criteria. A generalization which attaches weights to the pairwise slopes and uses a weighted median has been developed which results in a small improvement in terms of the standard error. However, in some cases the bias of this modified estimator is considerably worse[30]. The time complexity of the naïve algorithm which calculates all of the slopes and then takes the median is in O(n2 ). Faster algorithms with time complexity in O(n log n) have been developed[17], but this does not pose a significant improvement for the problem sizes encountered. As shown in figure 5.1a, outliers can easily confuse the ordinary least squares estimator represented by the dashed line, while the Theil-Sen estimator ignores them and fits the whole model better. For other series, such as the one in figure 5.1b the behavior of both of the estimators is comparable. 24
5. Robust regression
5.2 Moore-Wallis Test Also known as the difference-sign test[8], the method has been first described by Moore and Wallis[20] in 1941. The test is distribution-free, that is, the usual assumption of normality necessary for the standard statistical methods does not need to hold, and the calculation is very fast and simple, as only the differences of adjacent values need to be examined. The following derivation follows [25]. Given a sequence of n values y1 , . . . , yn 2 . Defining ( 1 if yi+1 > yi Di = (5.2) 0 otherwise the number of positive first differences can then be written as D=
n −1
∑ Di
(5.3)
i =1
and the expected value of D and the variance V [ D ] are
E[ D ] = V [D] =
n −1
∑ E [ Di ]
(5.4)
i =1 n −1
∑ V [ Di ] + 2 ∑ Cov( Di , Dj )
i =1
(5.5)
i< j
From the null hypothesis that the examined series does not exhibit any serial dependence follows the fact that every ordering of the values is equally probable, and therefore the same holds for the signs of differences, so the expected value of Di is 21 . Additionally, the only covariances different from zero are those which share the middle value from the underlying series. The possible orderings of yi−1 , yi and yi+i are 2. The canonical definition does not account for the case when yi = yi+1 . Letting Di = 21 in that case preserves the properties of this test when the alternative hypothesis asserts the existence of a trend.
25
5. Robust regression y i −1 1 1 2 2 3 3
yi 2 3 1 3 1 2
y i +1 3 2 3 1 2 1
Di 1 1 0 1 0 0
Di + 1 1 0 1 0 1 0
Under the null hypothesis every of these possibilities is equally likely to appear, so we can see that the expected value of Di Di+1 is 1 6 , and therefore Cov( Di , Di+1 ) = E[ Di Di+1 ] − E[ Di ] E[ Di+1 ] =
1 1 1 − =− (5.6) 6 4 12
and Cov( Di , D j ) = 0 in every other case. There are n − 2 pairs for which the covariance is non-zero, and V [ Di ] = E[ Di2 ] − ( E[ Di ])2 =
1 1 1 − ( )2 = 2 2 4
(5.7)
substituting in equation (5.4) then yields n−1 2 1 n+1 1 V [ D ] = ( n − 1) − ( n − 2) = 4 6 12 E[ D ] =
(5.8) (5.9)
That is, if the null hypothesis holds, it can be expected that the number of positive first differences is E[ D ] with standard deviation p of V [ D ]. A standardized statistic can be derived: D − E[ D ] U (D) = p V [D]
(5.10)
which is normally distributed with mean zero and standard deviation 1[8], so that the null hypothesis is to be rejected when |U ( D )| ≥ u1− α2 at the significance level of α. Although the power of this test asymptotically approaches unity as the number of samples increases[25], for small sample sizes noise 26
5. Robust regression 9 8 7 6 5 4 3 2 1 0
0
2
4
6
8
10 12 14 16
(a) Synthetic example 1
16 14 12 10 8 6 4 2 0
0
2
4
6
8
10 12 14 16
(b) Synthetic example 2
Figure 5.2: Series incorrectly classified by the Moore-Wallis test has a considerable effect on the outcome. The reason for this is that only local features are inspected and the actual magnitudes of the differences are not taken into account. For example, carrying the test out on the series shown in figure 5.2a results in the decision that no trend is present. On the other hand, a downward trend is identified in the series in figure 5.2b, significant at p = 0.01.
5.3 Mann-Kendall test Considerably more powerful than the Moore-Wallis test is the MannKendall test, which correctly detects the trend in the figure 5.2a with p = 1.6 × 10−5 and the trend in the figure 5.2b with p = 0.0002. The Mann-Kendall test statistic is defined as[8, 21] n
S=
i
∑ ∑ sgn(xi − x j ) sgn(yi − y j )
(5.11)
i =1 j =1
where n is the number of samples ( x1 , y1 ), ..., ( xn , yn ). Intuitively, S can be described as the total amount of pairs of samples that are ordered correctly. A top score of S = (n2 ) indicates that the series is increasing everywhere while S = −(n2 ) means that the series is decreasing. Under the null hypothesis, and when there are no ties in the 27
5. Robust regression data, S has the following properties[21]: E[S] = 0 (5.12) n(n − 1)(2n + 5) V [S] = (5.13) 18 In case there are some tied values, the variance of S is computed as follows: n(n − 1)(2n + 5) − ∑in=1 ti (i )(i − 1)(2i + 5) (5.14) V [S] = 18 where ti is the number of tied values in the i-th group3 The standardized4 Z statistic is computed as √S−1 S>0 V [S] Z= 0 S=0 S + 1 S