Methods for detection of word usage over time

0 downloads 171 Views 221KB Size Report
1990. 2000. 0. 1. 2. 3. 4. 5. 6. (c) Google ngrams yearly occurences of the word 'ant'. Ondrej Herman (FI MUNI). Detecti
Methods for detection of word usage over time Ondˇrej Herman FI MUNI

7. 12. 2013

Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

1 / 19

Motivation natural language is not a static object word usage changes over time

Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

2 / 19

Motivation natural language is not a static object word usage changes over time natural language corpora provide relevant data

3.0

25

6

2.5

20

5

2.0

3 10

1.0

2

5

0.5 0.0

4

15

1.5

2004

2008

(a) OEC

0

1 1984

1990

(b) BNC

0

1980

1990

2000

(c) Google ngrams

yearly occurences of the word ’ant’

Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

2 / 19

Motivation natural language is not a static object word usage changes over time natural language corpora provide relevant data

3.0

25

6

2.5

20

5

2.0

3 10

1.0

2

5

0.5 0.0

4

15

1.5

2004

0

2008

(a) OEC

1 1984

1990

(b) BNC

0

1980

1990

2000

(c) Google ngrams

yearly occurences of the word ’ant’

difficult to interpret Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

2 / 19

Overview

classical least-squares regression analysis robust regression methods

Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

3 / 19

Linear regression 35 30 25 20 15 10 5 0

1980

1990

2000

’slight’ - Google ngrams

simple linear model: y = a + bx +  regression line calculated using method, that is, by Pnthe least-squares 2 minimizing the value of e = i=1 i Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

4 / 19

Linear regression 35 30 25 20 15 10 5 0

1980

1990

2000

’slight’ - Google ngrams

polynomial model coefficient of determination (R 2 ) adjusted R 2 Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

5 / 19

Weighted linear regression 5

5

4

4

3

3

2

2

1

1

0 0 1965 1970 1975 1980 1985 1990 1995 1965 1970 1975 1980 1985 1990 1995

(a) W = total counts

(b) W = log(total counts)

’evil’ - British National Corpus

linear model directly using the total counts as the weights skews the results Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

6 / 19

Weighted linear regression 0.85 0.80 0.75 0.70 0.65 0.60

1

2

3

4

5

6

(a) adjusted R 2

7

8

1.2 1.0 0.8 0.6 0.4 0.2 0.0 −0.2 9

1980

1990

2000

2 (b) model with maximal Radj

’Chernobyl’ - Google ngrams

R 2 , the coefficient of determination, is the fraction of variance explained by the regression model R 2 increases with the degree of the regression model kitchen sink regression Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

7 / 19

Weighted linear regression 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0

1980

1990

2000

’slight’ - Google ngrams

linear model logarithmic transformation of frequencies

Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

8 / 19

Linear regression - significance testing

t-test I

tests the significance of a single regression coefficient

F-test I

tests the significance of the whole model

Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

9 / 19

Linear regression - significance testing 1.2 1.0 0.8 0.6 0.4 0.2 0.0

2.5 2.0 1.5 1.0 0.5 0.0

2004

2008

1980 1990 2000

(a) ’steep’ Oxford English Corpus, (b) ’carrot’ from Google ngrams, p = 0.414

p = 4.3 × 10−10

example F-test p-values

H0 : the mean predicts the behavior of the series well H1 : the given regression model predicts the behavior well Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

10 / 19

Robust regression

Moore-Wallis test Mann-Kendall test Spearman’s ρ Theil-Sen metod

Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

11 / 19

Moore-Wallis test also known as the sign-difference test

9 8 7 6 5 4 3 2 1 0

0

2

4

6

8

10 12 14 16

16 14 12 10 8 6 4 2 0

0

2

4

6

8

10 12 14 16

no trend is detected in the first series, a downward trend is detected in the second series asymtotically optimal on short series the power of the test is low Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

12 / 19

Theil-Sen estimator defined as the median of the pairwise slopes of the samples: b 0 = med

yi − yj , xi − xj

i 6= j

5 3.5 3.0 4 2.5 2.0 3 1.5 2 1.0 0.5 1 0.0 −0.5 0 1965 1970 1975 1980 1985 1990 1995 1965 1970 1975 1980 1985 1990 1995

(a) ’spice’

(b) ’snow’

Behavior of the Theil-Sen estimator for words encountered in the British National Corpus Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

13 / 19

Mann-Kendall test used to test the significance of a regression model fitted using the Theil-Sen estimator

S=

n X i X i=1 j=1

6 5 4 3 2 1 0

1976 1984 1992 (a) ’oil’, p = 0.021

6 5 4 3 2 1 0 −1

sgn(xi − xj ) sgn(yi − yj )

1976 1984 1992 (b) ’disk’, p = 0.009

6 5 4 3 2 1 0

1976 1984 1992

(c) ’slow’, p = 0.821

Words from the British National Corpus tested using the Mann-Kendall test with the trend line fitted using the Theil-Sen estimator Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

14 / 19

Spearman’s ρ

calculated as the correlation coefficient of a linear model obtained by using the rank of the observations instead of the actual value yields almost the same results as the Mann-Kendall test the distribution of the test scores is more difficult to calculate

Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

15 / 19

Slope normalization the slope estimates are not directly comparable, they need to be normalized

d=

b0 y¯

where bˆ is the estimated slope and y¯ is the mean of y , the observed frequencies.

On the next slide: the slopes obtained from Google ngrams of the 50 most common words from the Oxford English Corpus ordered by the slope relative to the mean d Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

16 / 19

word which been his he It were be by there was has of had would all but one not the it will is at The this

d −1.256 −0.862 −0.836 −0.804 −0.744 −0.713 −0.69 −0.669 −0.645 −0.601 −0.572 −0.527 −0.512 −0.5 −0.496 −0.451 −0.427 −0.422 −0.4 −0.381 −0.359 −0.346 −0.338 −0.326 −0.308

ˆ b −36.231 −13.977 −23.698 −20.531 −9.081 −16.541 −33.798 −31.319 −7.4 −31.296 −9.966 −171.39 −12.061 −7.105 −8.423 −8.853 −7.862 −14.476 −194.423 −15.137 −4.771 −30.998 −10.113 −20.124 −8.888

word in have who are more from to and a for on with as an or they we their said that up I about can you

d −0.302 −0.249 −0.209 −0.206 −0.2 −0.178 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.098 0.279 0.488 0.504 0.678 1.523

ˆ b −50.594 −6.764 −2.853 −8.2 −3.146 −6.051 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 7.687 2.488 15.796 5.583 10.541 23.465

rej Herman (FI MUNI) from Google Detection of word usage over 50 timemost common words 7. 12. 2013 / 19 TheOndˇ slopes obtained ngrams of the from 17 the

Future work

anomaly detection piecewise linear model

Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

18 / 19

Conclusion

Mann-Kendall test together with the Theil-Sen estimator give the best results standard linear regression model gives satisfactory results most of the time

Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

19 / 19

Suggest Documents