Methods for detection of word usage over time

Methods for detection of word usage over time Ondˇrej Herman FI MUNI

7. 12. 2013

Ondˇrej Herman (FI MUNI)

Detection of word usage over time

7. 12. 2013

1 / 19

Motivation natural language is not a static object word usage changes over time



7. 12. 2013

2 / 19

Motivation natural language is not a static object word usage changes over time natural language corpora provide relevant data

3.0

25

6

2.5

20

5

2.0

3 10

1.0

2

5

0.5 0.0

4

15

1.5

2004

2008

(a) OEC

0

1 1984

1990

(b) BNC

0

1980

1990

2000

(c) Google ngrams

yearly occurences of the word ’ant’



7. 12. 2013

2 / 19

Motivation natural language is not a static object word usage changes over time natural language corpora provide relevant data

3.0

25

6

2.5

20

5

2.0

3 10

1.0

2

5

0.5 0.0

4

15

1.5

2004

0

2008

(a) OEC

1 1984

1990

(b) BNC

0

1980

1990

2000

(c) Google ngrams

yearly occurences of the word ’ant’

difficult to interpret Ondˇrej Herman (FI MUNI)


7. 12. 2013

2 / 19

Overview

classical least-squares regression analysis robust regression methods



7. 12. 2013

3 / 19

Linear regression 35 30 25 20 15 10 5 0

1980

1990

2000

’slight’ - Google ngrams

simple linear model: y = a + bx + regression line calculated using method, that is, by Pnthe least-squares 2 minimizing the value of e = i=1 i Ondˇrej Herman (FI MUNI)


7. 12. 2013

4 / 19

Linear regression 35 30 25 20 15 10 5 0

1980

1990

2000


polynomial model coefficient of determination (R 2 ) adjusted R 2 Ondˇrej Herman (FI MUNI)


7. 12. 2013

5 / 19

Weighted linear regression 5

5

4

4

3

3

2

2

1

1

0 0 1965 1970 1975 1980 1985 1990 1995 1965 1970 1975 1980 1985 1990 1995

(a) W = total counts

(b) W = log(total counts)

’evil’ - British National Corpus

linear model directly using the total counts as the weights skews the results Ondˇrej Herman (FI MUNI)


7. 12. 2013

6 / 19

Weighted linear regression 0.85 0.80 0.75 0.70 0.65 0.60

1

2

3

4

5

6

(a) adjusted R 2

7

8

1.2 1.0 0.8 0.6 0.4 0.2 0.0 −0.2 9

1980

1990

2000

2 (b) model with maximal Radj

’Chernobyl’ - Google ngrams

R 2 , the coefficient of determination, is the fraction of variance explained by the regression model R 2 increases with the degree of the regression model kitchen sink regression Ondˇrej Herman (FI MUNI)


7. 12. 2013

7 / 19

Weighted linear regression 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0

1980

1990

2000


linear model logarithmic transformation of frequencies



7. 12. 2013

8 / 19

Linear regression - significance testing

t-test I

tests the significance of a single regression coefficient

F-test I

tests the significance of the whole model



7. 12. 2013

9 / 19

Linear regression - significance testing 1.2 1.0 0.8 0.6 0.4 0.2 0.0

2.5 2.0 1.5 1.0 0.5 0.0

2004

2008

1980 1990 2000

(a) ’steep’ Oxford English Corpus, (b) ’carrot’ from Google ngrams, p = 0.414

p = 4.3 × 10−10

example F-test p-values

H0 : the mean predicts the behavior of the series well H1 : the given regression model predicts the behavior well Ondˇrej Herman (FI MUNI)


7. 12. 2013

10 / 19

Robust regression

Moore-Wallis test Mann-Kendall test Spearman’s ρ Theil-Sen metod



7. 12. 2013

11 / 19

Moore-Wallis test also known as the sign-difference test

9 8 7 6 5 4 3 2 1 0

0

2

4

6

8

10 12 14 16

16 14 12 10 8 6 4 2 0

0

2

4

6

8

10 12 14 16

no trend is detected in the first series, a downward trend is detected in the second series asymtotically optimal on short series the power of the test is low Ondˇrej Herman (FI MUNI)


7. 12. 2013

12 / 19

Theil-Sen estimator defined as the median of the pairwise slopes of the samples: b 0 = med

yi − yj , xi − xj

i 6= j

5 3.5 3.0 4 2.5 2.0 3 1.5 2 1.0 0.5 1 0.0 −0.5 0 1965 1970 1975 1980 1985 1990 1995 1965 1970 1975 1980 1985 1990 1995

(a) ’spice’

(b) ’snow’

Behavior of the Theil-Sen estimator for words encountered in the British National Corpus Ondˇrej Herman (FI MUNI)


7. 12. 2013

13 / 19

Mann-Kendall test used to test the significance of a regression model fitted using the Theil-Sen estimator

S=

n X i X i=1 j=1

6 5 4 3 2 1 0

1976 1984 1992 (a) ’oil’, p = 0.021

6 5 4 3 2 1 0 −1

sgn(xi − xj ) sgn(yi − yj )

1976 1984 1992 (b) ’disk’, p = 0.009

6 5 4 3 2 1 0

1976 1984 1992

(c) ’slow’, p = 0.821

Words from the British National Corpus tested using the Mann-Kendall test with the trend line fitted using the Theil-Sen estimator Ondˇrej Herman (FI MUNI)


7. 12. 2013

14 / 19

Spearman’s ρ

calculated as the correlation coefficient of a linear model obtained by using the rank of the observations instead of the actual value yields almost the same results as the Mann-Kendall test the distribution of the test scores is more difficult to calculate



7. 12. 2013

15 / 19

Slope normalization the slope estimates are not directly comparable, they need to be normalized

d=

b0 y¯

where bˆ is the estimated slope and y¯ is the mean of y , the observed frequencies.

On the next slide: the slopes obtained from Google ngrams of the 50 most common words from the Oxford English Corpus ordered by the slope relative to the mean d Ondˇrej Herman (FI MUNI)


7. 12. 2013

16 / 19

word which been his he It were be by there was has of had would all but one not the it will is at The this

d −1.256 −0.862 −0.836 −0.804 −0.744 −0.713 −0.69 −0.669 −0.645 −0.601 −0.572 −0.527 −0.512 −0.5 −0.496 −0.451 −0.427 −0.422 −0.4 −0.381 −0.359 −0.346 −0.338 −0.326 −0.308

ˆ b −36.231 −13.977 −23.698 −20.531 −9.081 −16.541 −33.798 −31.319 −7.4 −31.296 −9.966 −171.39 −12.061 −7.105 −8.423 −8.853 −7.862 −14.476 −194.423 −15.137 −4.771 −30.998 −10.113 −20.124 −8.888

word in have who are more from to and a for on with as an or they we their said that up I about can you

d −0.302 −0.249 −0.209 −0.206 −0.2 −0.178 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.098 0.279 0.488 0.504 0.678 1.523

ˆ b −50.594 −6.764 −2.853 −8.2 −3.146 −6.051 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 7.687 2.488 15.796 5.583 10.541 23.465

rej Herman (FI MUNI) from Google Detection of word usage over 50 timemost common words 7. 12. 2013 / 19 TheOndˇ slopes obtained ngrams of the from 17 the

Future work

anomaly detection piecewise linear model



7. 12. 2013

18 / 19

Conclusion

Mann-Kendall test together with the Theil-Sen estimator give the best results standard linear regression model gives satisfactory results most of the time



7. 12. 2013

19 / 19

Methods for detection of word usage over time

Methods for detection of word usage over time

Suggest Documents

Automatic methods for detection of word usage over time

Learning Word Relatedness over Time - Association for ...

mfi usage over time tool - MicroSave

Different methods of real-time PCR for detection of ... - Ampliqon

Generalized APP Detection for Communication over Unknown Time ...

Simple Methods for Peak Detection in Time Series ... - CiteSeerX

SIMPLE METHODS FOR PEAK AND VALLEY DETECTION IN TIME ...

Automatic Syntactic Analysis for Detection of Word

Analysing Word Meaning over Time by Exploiting Temporal Random ...

Usage of the word template (English) for JSAI2005 - kaigi.org

Usage of the word template (English) for JSAI2005 - kaigi.org

Engine Architecture for Real Time Web Usage

NetWordS 2015 Word Knowledge and Word Usage CONFERENCE ...

NetWordS 2015 Word Knowledge and Word Usage CONFERENCE ...

Evaluation methods for unsupervised word embeddings

Combining Unsupervised Lexical Knowledge Methods for Word ...

FAST IMPLEMENTATION METHODS FOR VITERBI-BASED WORD ...

TextDNA: Visualizing Word Usage with ... - Semantic Scholar

Signal detection for non-orthogonal space-time block coding over time

Supervised Word-Level Metaphor Detection - Association for ...

Automatic Word Similarity Detection for TREC 4

Antemortem versus postmortem methods for detection of ...

Objective methods for reliable detection of

Rapid methods for detection of bacteria