Brendan O'Connor November 16, 2012 - Brendan T. O'Connor

2 downloads 128 Views 10MB Size Report
Nov 16, 2012 - ... from natural language processing, information retrieval, data mining, machine learning as ... stock p
Social Text Data Analysis Brendan O’Connor Machine Learning Department Carnegie Mellon University Talk at UChicago Computational Social Science Workshop seminar

November 16, 2012

What can statistical text analysis tell us about society?

• • • •

Manual content analysis: analyze ideas, concepts, opinions etc. in text. Long and rich history (Krippendorff 2012) -- but very labor-intensive From manual to automated content analysis

• • •

Quantitative comparisons Pattern recognition Qualitative drilldowns

Many emerging examples of automated text content analysis [Political science, media studies, economics, psychology, sociology of science, sociolinguistics, public health, history, literature...]

Appropriating tools from natural language processing, information retrieval, data mining, machine learning as quantitative social science methodology 2

Text as measurement?: concepts

U.S. convention speeches’ word frequencies, by party http://www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html

following the election of a Labor government in Israel in the summer of 1992. As with the case of the USA-USSR interactions, this time series gives a more exact measure of

Text as measurement: events

the patterns of events over time. For example, while the intifada follows a lull in conflict during the summer of 1987, the event data also show a general increase in conflict beginning about 18 months earlier. This increase may have been a precursor to the larger uprising.

Figure 2 Israel-Palestinian interactions, 1982-1992

Net Cooperation

0 -100 -200 -300

ISR>PAL

-400

PAL>ISR

-500

92-07

92-01

91-07

91-01

90-07

90-01

89-07

89-01

88-07

88-01

87-07

87-01

86-07

86-01

85-07

85-01

84-07

84-01

83-07

83-01

82-07

82-01

-600

As these two figures illustrate, event data can be used to summarize the overall relationship

Automatic event extraction from news reports summaries of the interactions found in historical sources, but unlike narrative accounts, event data can (Schrodt 1993) be subjected to statistical analysis. As a consequence, event data are frequently used to study foreign between two countries over time. The patterns shown by event data usually correspond to the narrative

policy outcomes and some characteristics of the international environment within which foreign policy decisions occur.

4

Taxonomy of text analysis methods Vanilla topic models

... with (partial) labels

... with lexical constraints

Document classification, regression Hypothesis tests etc. for word statistics Word counting

5

Dictionary word counting

Taxonomy of text analysis methods Computational / statistical complexity

Vanilla topic models

... with (partial) labels

... with lexical constraints

Document classification, regression Hypothesis tests etc. for word statistics Word counting

Dictionary word counting

Data/domain assumptions 5

Ba re

Can interpret text only through context

wo rd s

Data and Prior Knowledge

Data/domain assumptions 6

Data and Prior Knowledge

wo rd s

N at u

Ba re

Can interpret text only through context

ra l do ly-la cu be m le en d ts

Timestamps, author attributes, conversational context ...

Data/domain assumptions 6

Data and Prior Knowledge Associated responses: stock prices, voting decisions, citation networks ...

wo rd s

N at u

Ba re

Can interpret text only through context

ra l do ly-la cu be m le en d ts

Timestamps, author attributes, conversational context ...

Data/domain assumptions 6

Data and Prior Knowledge Topical categories, genres, event types, types of author expressions...

Associated responses: stock prices, voting decisions, citation networks ...

wo rd s

N at u

Ba re

Can interpret text only through context

ra l do ly-la cu be m le en d ts H an do d-la cu be m le en d ts

Timestamps, author attributes, conversational context ...

Data/domain assumptions 6

Data and Prior Knowledge Topical categories, genres, event types, types of author expressions...

Associated responses: stock prices, voting decisions, citation networks ...

Topic keywords, sentiment lexicons, name lists, verb lists ...

wo rd s

N at u

Ba re

Can interpret text only through context

ra l do ly-la cu be m le en d ts H an do d-la cu be m le en d ts H di an ct dio b na uil rie t s

Timestamps, author attributes, conversational context ...

Data/domain assumptions 6

Detecting cultural phenomena in textual social media Opinions and Time Brendan O’Connor, Ramnath Balasubramanyan, Bryan Routledge, Noah Smith

Language and Geography Jacob Eisenstein, Brendan O’Connor, Noah Smith, Eric Xing

Internet Censorship David Bamman, Brendan O’Connor, Noah Smith

Detecting cultural phenomena in textual social media Opinions and Time Brendan O’Connor, Ramnath Balasubramanyan, Bryan Routledge, Noah Smith

From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series. ICWSM 2010

Measuring public opinion through social media? !"#$%"&'(&)*+*&

approve

disapprove

9

Measuring public opinion through social media? !"#$%"&'(&)*+*& !"#$%&

approve

disapprove

9

Measuring public opinion through social media? !"#$%"&'(&)*+*& !"#$%&

approve

disapprove

9

Measuring public opinion through social media? !"#$%"&'(&)*+*& !"#$%&

approve

disapprove

!"#$%&

!"#$%&

9

!""#$"%&$' ($)&'*$+,-$+&' .$%/0#$'

Measuring public opinion through social media? !"#$%"&'(&)*+*& !"#$%&

approve

!"#$%&$'&()*&$"$$ +),)-"($,&"+.(&,&#/0$

disapprove

!"#$%&

!"#$%&

9

!""#$"%&$' ($)&'*$+,-$+&' .$%/0#$'

Method

10

Method Poll Data • Consumer confidence,

2008-2009 • Index of Consumer Sentiment (Reuters/ Michigan) • Gallup Daily

• 2008 Presidential Elections • Aggregation, Pollster.com

• 2009 Presidential Job Approval • Gallup Daily

Which tweets correspond to which polls? 10

Method Poll Data

Topic Keywords

• Consumer confidence,

Subset of Gardenhose public tweets over 2008-2009

2008-2009 • Index of Consumer Sentiment (Reuters/ Michigan) • Gallup Daily

• 2008 Presidential Elections • Aggregation, Pollster.com

• 2009 Presidential Job Approval • Gallup Daily

Which tweets correspond to which polls?

“economy” “jobs” “job” “obama” “mccain”

“obama” Analyzed subsets of messages that contained manually selected topic keyword 10

Method Poll Data

Topic Keywords

• Consumer confidence,

Subset of Gardenhose public tweets over 2008-2009

2008-2009 • Index of Consumer Sentiment (Reuters/ Michigan) • Gallup Daily

• 2008 Presidential Elections • Aggregation, Pollster.com

• 2009 Presidential Job Approval • Gallup Daily

Which tweets correspond to which polls?

Sentiment Analysis Subjectivity Clues lexicon from OpinionFinder (Wilson et al 2005)

“economy” “jobs” “job”

MessageCountt (pos. word AND top MessageCount t (neg. word sentiment (topic_word) = AND top t

“obama” “mccain”

“obama” Analyzed subsets of messages that contained manually selected topic keyword 10

=

p(pos. word | topic word, t) p(neg. word | topic word, t)

A note on the sentiment list • Not well-suited for social media English •

“sucks” “:)” “:(“

(Top examples) word valence will positive bad negative good positive help positive

count 3934 3402 2655 1971

(Random examples) word valence funny positive fantastic positive cornerstone positive slump negative bearish negative crackdown negative 11

count 114 37 2 85 17 5

Smoothed comparisons:

“jobs” sentiment vs. consumer confidence

12

Smoothed comparisons:

“jobs” sentiment vs. consumer confidence

13

Smoothed comparisons:

“jobs” sentiment vs. consumer confidence

14

Smoothed comparisons:

“jobs” sentiment vs. consumer confidence

15

Smoothed comparisons:

“jobs” sentiment vs. consumer confidence

16

Smoothed comparisons:

“jobs” sentiment vs. consumer confidence

17

Smoothed comparisons:

“jobs” sentiment vs. consumer confidence

Sept 15 2008: Lehman collapse

Mar 2009: Stock market begins recovery

18

Which leads, poll or text?

• •

Cross-corr between



Sentiment score on day t



Poll day t+L

sentiment(“jobs”) is leading indicator for poll

19

Keyword message selection 15-day windows, no lag sentiment(“jobs”)! ! ! ! ! r = +0.80 sentiment(“job”)! ! ! ! ! ! r = +0.07 sentiment(“economy”)! ! ! r = –0.10

Look out for stemming sentiment(“jobs” OR “job”) ! r = +0.40

20

Presidential elections [doesn’t work]



2008 elections sen(“obama”), sen(“mccain”) do not correlate to polls

21

Presidential elections [doesn’t work] Presidential job approval [~works]



2008 elections sen(“obama”), sen(“mccain”) do not correlate to polls



2009 job approval sen(“obama”) => r = 0.72 Looks easy: simple decline

21

election.twitter.com Proprietary sentiment analyzer over Obama vs Romney name-containing tweets

http://about.topsy.com/wp-content/uploads/2012/08/Twindex-report1.pdf

election.twitter.com Proprietary sentiment analyzer over Obama vs Romney name-containing tweets

http://about.topsy.com/wp-content/uploads/2012/08/Twindex-report1.pdf

Twitter and Polls

• Preliminary results that sentiment analysis on Twitter data can give •



• •

information similar to opinion polls • But, still not well-understood! Who is using Twitter? • Massive changes over time (2008 Twitter != 2012 Twitter) • News vs. opinion? • Other data sources might better distinguish? Better text analysis • Very wide linguistic variation on Twitter • Word sense ambiguity: “steve jobs” • Better data sources Suggestion for future work: analyze correlations to pre-existing surveys and other attitude measurements Not a replacement for polls, but seems potentially useful. Between ethnography and surveys?

• See also

"I Wanted to Predict Elections with Twitter and all I got was this Lousy Paper" -- A Balanced Survey on Election Prediction using Twitter Data Daniel Gayo-Avello, arXiv 2012 23

Computational / statistical complexity

Taxonomy of text analysis methods

Correlations, ratios, counts Word counting

wo rd s

ra l do ly-la cu be m le en d ts H an do d-la cu be m le en d ts H di an ct dio b na uil rie t s

Mar 2009: Stock market

N at u

Ba re

Sept 15 2008:

Dictionary word counting

24

Data/domain assumptions

Detecting cultural phenomena in textual social media Opinions and Time Brendan O’Connor, Ramnath Balasubramanyan, Bryan Routledge, NoahASmith Latent Variable

Model for Geographic Lexical Variation. EMNLP 2010.

Internet Censorship

David Bamman, Brendan O’Connor, Noah Smith

Language and Geography Jacob Eisenstein, Brendan O’Connor, Noah Smith, Eric Xing



Languages exhibit variation, reflecting geography, status, race, gender, etc. 26

!"#$%&'()*+,$*-'#."%/*'(*0,%'#.*1"-'#

!

2("*#33$,#%&4*0"#$%&*+,$*5(,6(*7#$'#8."* #./"$(#/',(09*":):*;, 1,818 (used at least 5 times in one week within one MSA)

46

Diffusion of novel words weeks 1-8

weeks 35-43

weeks 78-86

bruh !

!

! ! ! ! !!

af

! !! ! !! !!! ! !

!

!

! !! !! ! ! ! !!! ! ! !!! !! ! ! !

!

! !!! !!!! !! ! !!! ! ! ! ! ! ! ! ! ! !! !! ! !! ! !! ! ! ! ! !! ! ! ! ! !!!!!!! ! ! !! !!!! ! ! ! ! ! ! ! !! !

! ! !! !!! ! !

!!

!

!! ! ! !!

! !

!

!

-__-

! ! ! ! ! ! ! !

!! !! !! !!! ! ! ! !! ! ! ! !! ! ! ! ! ! ! !

!! !!

!!! !

! !

! ! !! !

! ! ! ! ! ! !!! ! ! ! !!! ! ! !

! !

! !!! !!!!!! ! ! ! !! ! !! ! !! !! ! ! ! !! !! ! ! ! ! ! ! !! ! ! ! ! !!!! ! !! ! !! ! ! ! ! ! ! ! ! ! ! ! !!!! ! ! ! ! !! ! !!! ! ! ! !! ! ! !! !!! ! ! ! ! !! !!

Figure 1: Change in popularity for three words: bruh, af, - -. Blue circles indicate cities in which the probability of an author using the word in a week in the listed period is greater than 1% (2.5% for bruh).

Clearly geography plays some role in each example — most notably in bruh, which may reference southeastern phonology. But each example also features long range “jumps” over huge parts of the 47

Model • •

MSAs as nodes in diffusion network



Can’t use simple cross-correlations: confounds for size of region, global events (e.g. TV shows, holidays), etc.



Logistic binomial probability whether an author uses a word at a particular (region, time)



“Influence” represented as linear dynamical system (latent vector autoregression)

“Influence”/“Transmission” from r -> s: if words popular in region r later become popular in s. [Lead-lag notion of influence/transmission]

48

(k) ωi,r,t ci,r,t

weight of particle k in the forward pass of the sequential Monte Carlo algorithm count of individuals who use word i in region r at time t del: s count of individuals who post messages inand region r at time t r,t word frequencies by time region Table 1: Summary of mathematical notation. The index i indicates words, r indicates re indicates (weeks). probability of using word i in region r at time t θi,r,t timeestimated cνi,r,t ∼overall Binomial(s , σ(ν + τ + η + η )) (1) r,t i r,t i,∗,t i,r,t log-frequency of word i i τr,t general activation of region r at time t count of authors count of authors who of overall (logodds) η global activation word ifor at time t activation We general of global activationthe of i,∗,t ents A govern lexical diffusion all words. rewrite sum ignored. yields theinlinear model: who use word iThis in post messages frequency of word i region region r rat time t word i at time t η activation of word i at time t in i,r,t r at time t region r at time t hr ηregion , where η i,t concatenation is the vertical concatenation eachof ηsize and i,r,t i,t η i,t �, aofvector vertical of each ηi,r,t and ηi,∗,t R+ 1 xi,r,t x η ∼ Normal(Aη , Γ) σ(x)η= c ∼ Binomial(s , σ(ν + τ + η E[η ] = A η r,t i r,t i, i,tpicks i,t−1 ctor that thefunction, elements and η . i,t,r r,s i,t−1,s σ(·) theout logistic e /(1 + e ) i,r,t i,∗,t

Model:

s

autoregressive coefficients (size R × R)lexical whereAthe region-to-region coefficients A govern diffusion for allcoefwords. W nfidence intervals around the cross-regional autoregression Γ variance of theproduct autoregressive process (size R the × R) η + η as a vector h η , where η is vertical concatenation i,∗,t i,r,t r i,t i,t s a function of the regional-temporal word activations ηi,r,t . We ηi,∗,t , and h indicator vector that picks out the elements ηi,r,t and ηi,∗,t r is a row Model fitting with EM parameter of the Taylor approximation to the logistic binomial likelihood i,r,t putingζsamples for the trajectories η i,r , and then computing point • multimate Gaussian pseudo-emission in the Kalman smoother i,r,t Our goal is to estimate confidence intervals around the cross-regional au gregating over all words i. Bayesian confidence intervals can then 2 (30 million variables, a smoother while) Σ emission in the ficients arevariance computed as Kalman atakes function of the regional-temporal word act i,r,tA,•which mates, ofapproach, the form of the estimator to compute A.the takeregardless a(k)Monte Carlo computing samples for used the trajectories η i,r , and Fittingofeta: Gaussian approximation to logistic-binom • ω weight particle k in the forward pass of the sequential Monteconfidenc Carlo alg e detail. i,r,t of A for each sample, aggregating over all words estimates i. Bayesian



be computed from these point estimates, of the form of the estimator u Kalman smoother to learnregardless diag(A); non-diagonals from Summary of mathematical notation. The index i indicates words, r indic WeTable now 1: discuss these steps in more detail. VAR fits to eta samples indicates time (weeks).

imation of word activations

49

Smoothing examples 0.14 3 2

Cleveland, OH

0.1 p(w)

1 !t

Philadelphia, PA

0.12

0 −1

0.08

0.04

−3

0.02 10

20

30

40 50 week

60

70

80

Erie, PA

0.06

−2 −4

Pittsburgh, PA

0

10

20

30

40 50 week

60

70

80

Figure 2: Left: Monte Carlo and Kalman smoothed estimates of η for the word ctfu in five cities. Right: estimates of term frequency in each city. Circles show empirical frequencies.

4.2 Estimation of system dynamics parameters The parameter Γ controls the variance of the latent state, and the diagonal elements of A control the amount of autocorrelation within each region. The maximum likelihood settings of these parameters are closely tied to the latent state η. For this reason, we estimate these parameters using an expectation-maximization algorithm, iterating between maximum-likelihood estimation of Γ and diag(A) and computation of a variational bound on 50P (η i |ci , s; Γ, A). We run the sequential Monte Carlo procedure described above only after converging to a local optimum for Γ and diag(A). This

Estimating cross-region effects

maximum-likelihood estimate for the system coefficients is ordinary least squares, � �−1 A = η T1:T −1 η 1:T −1 η T2:T η 1:T −1

Transmission edge selection

• •

egression Simulated includes aposterior bias term.samples We experimented of A matrix with ridge regression for regulariza lizing the squared Frobenius norm of A), comparing regularization coefficients by c of the 39,600 non-diagonals (edges) are ation withinWhich the time series. We found that regularization cansignificantly improve thenon-zero? likelihood on h Multiple hypothesis testing words, but when words are aggregated, regulariza ata when computing A for individual s no improvement.



we compute confidence estimatestoo for A. We compute A(k) for each sampled sequence η Bonferroni correction conservative: ming over all words. Wefamilywise can then compute Monte Carlo-based confidence intervals around controlling error rate �K (k) �K (k 1 1 2 P( >= 1a false positive) Am,n , by fitting Gaussian to the samples: µm,n = K k Am,n and σm,n = K k (Am )2 . To identify coefficients that are very likely to be non-zero, we compute Wald tests u Benjamini-Hochberg: control the 2 res zm,n = µm,n /σm,n . We apply the Benjamini-Hochberg False Discovery Rate correc False Discovery Rate ultiple hypothesis tests [34], computing a one-tailed z-score threshold z¯ such that the E[ % false positive rtion of false positives will be] z¯}

ficients thatyields pass 3544 this threshold edges can be confidently considered to indicate cross-regional i in the autoregressive model. 51

Transmission network ! !

t n i r p e R t o N !

! !

! ! ! ! ! ! !! !

!

o D

!

!

!

!

!

!

!!

!

! !

!

!

!

! ! !

!

!

!

! !

!

!

!

t n i r p e R t o N !

!

!

!

!

!

!

!

!

!

! !

!

! !

o D

!

! !

!

Figure 4: Lexical influence network: high-confidence, high-influence links (z > 3.12, µ > 0.025). Left: among all 50 largest MSAs. Right: subnetwork for the Eastern seaboard area. See also Figure 5, which uses same colors for cities. geo distance 10.5 ± 0.5 20.8 ± 0.6

% White 10.2 ± 0.4 16.2 ± 0.5

% Af. Am. % Hispanic % urban %renter log income linked 8.45 ± 0.37 9.88 ± 0.64 9.55 ± 0.34 6.30 ± 0.24 0.181 ± 0.006 unlinked 16.3 ± 0.5 15.2 ± 0.8 12.0 ± 0.4 6.78 ± 0.25 0.201 ± 0.007 52 Table 2: Geographical distances and absolute demographic differences for linked and non-linked pairs of

Transmission ego networks t n i r p e R t o N

New York, NY

o D

o D

t n i r p e R t o N

Los Angeles, CA

o D

t n i r p e R t o N Chicago, IL

o D 53

t n i r p e R t o N Dallas, TX

Houston, TX

Miami, FL

Washington, DC

Atlanta, GA

Boston, MA−NH

Cleveland, OH

Orlando, FL

San Antonio, TX

Kansas City, MO

Las Vegas, NV

Phoenix, AZ

San Francisco, CA

Riverside, CA

Seattle, WA

Columbus, OH

Charlotte, NC

Indianapolis, IN

Austin, TX

San Diego, CA

St. Louis, MO

Tampa, FL

Baltimore, MD

Providence, RI

Nashville, TN

Milwaukee, WI

Jacksonville, FL

Pittsburgh, PA

Portland, OR

Cincinnati, OH

Sacramento, CA

Louisville, KY

Richmond, VA

Oklahoma City, OK

Hartford, CT

Orlando, FL

San Antonio, TX

Kansas City, MO

Las Vegas, NV

Birmingham, AL

Salt Lake City, UT

Raleigh, NC

Buffalo, NY

Columbus, OH

Charlotte, NC

Indianapolis, IN

Austin, TX

Detroit, MI

San Jose, CA

Minneapolis, MN

Virginia Beach, VA

Denver, CO

Memphis, TN

Cleveland, OH

New Orleans, LA

San Jose, CA

t o N nt o i r D p e R

Figure 5: Ego networks for each of the 50 largest MSAs. Unlike Figure 4, all incoming and outgoing links are included (having z-score > 3.12). A blue link between cities indicates there is both an incoming and outgoing edge; green indicates outgoing-only; orange indicates incoming-only. Maps are ordered by population size. Official MSA names include up to three city names to54describe the area; we truncate to the first.

How does distance matter? Houston, TX

Dallas, TX

o D

o D

t n i r p e R t o N

t n i r p e R t o N

o D San Antonio, TX

Austin, TX

t n i r p e R t o N

o D 55

t n i r p e R t o N

!

Symmetric properties of transmission Figure 4: Lexical influence network: high-confidence, high-influence links (z > 3.12, µ > 0.025). Left: among all 50 largest MSAs. Right: subnetwork for the Eastern seaboard area. See also Figure 5, which uses same colors for cities.

linked unlinked

geo distance 10.5 ± 0.5 20.8 ± 0.6

% White 10.2 ± 0.4 16.2 ± 0.5

% Af. Am. 8.45 ± 0.37 16.3 ± 0.5

% Hispanic 9.88 ± 0.64 15.2 ± 0.8

% urban 9.55 ± 0.34 12.0 ± 0.4

%renter 6.30 ± 0.24 6.78 ± 0.25

log income 0.181 ± 0.006 0.201 ± 0.007

Table 2: Geographical distances and absolute demographic differences for linked and non-linked pairs of MSAs. Confidence intervals are p < .01, two-tailed.

Table 3: Logistic regression to predict linked pairs of MSAs

confidence influence links to and from all other cities (also called ego networks). The role of geographical proximity is apparent: therepredicting are denseinfluence connections (a) Logistic regression coefficients links within regions such as the northeast, between Boldand typeface indicates statistical signifimidwest, and MSAs. west coast, relatively few cross-country connections.(b) Accuracy of predicting influence cance at p < .01.

links, with ablated feature sets.

By analyzing the properties of pairs of cities with significant influence, we can identify the geot-value feature set accuracy gap graphic and demographic driversestimate of linguistics.e. influence. First, we consider the ways in which simintercept -0.0601 0.0287 -2.10 all features 72.3 ilarity and proximity between regions causes them to share linguistic influence. Then we consider product of populations 0.393 0.048 8.22 -population 71.6 0.7 the asymmetric properties that cause some regions to be linguistically influential. distance -0.870 0.033 -26.1 -geography 67.6 4.7 abs. diff. % White -0.214 0.040 -5.39 -demographics 66.9 5.4 abs. diff. %properties Af. Am. of linguistic -0.693 influence 0.042 -16.7 5.1 Symmetric abs. diff. % Hispanic -0.140 0.030 -4.63 abs.symmetric diff. % urban To assess properties of-0.170 linguistic0.030 influence,-5.76 we compare pairs of MSAs linked by nonabs. diff. % renters -0.0314randomly-selected 0.0304 -1.04pairs of MSAs. Specifically, we compute zero autoregressive coefficients, versus abs. diff.distribution log income over senders 0.0458 and0.0301 1.52 often each city fills each role), and we the empirical receivers (how

sample pairs from these distributions. The baseline thus includes the same distribution of MSAs as the experimental condition, randomizes theAf.associations. Even if our to income Log pop. but% White % Am % Hispanic % model Urban were % predisposed Renters Log detect difference influence among certain -0.0858 types of MSAs (for example, large or dense cities), 0.0231 that would not0.0546 0.968 0.0703 0.0094 0.0612 56 bias this analysis, since the aggregate makeup of the linked and non-linked pairs is identical. s.e. 0.0543 0.0065 0.0063 0.0098 0.0054 0.0041 0.0113

abs. diff. % Hispanic -0.140 0.030 -4.63 Table 4 shows the average -0.170 difference0.030 in population abs. diff. % urban -5.76 and demographic attributes between the sender and cities in each-0.0314 of the 4660.0304 asymmetric abs.receiver diff. % renters -1.04pairs. Senders have significantly higher populations, more Whites,1.52 more renters, greater income, and are more urban abs. diff. logAfrican income Americans, 0.0458 fewer 0.0301

Senders vs. Receivers

(p < 0.01 in all cases). Because demographic attributes and population levels are correlated, we again turn to logistic regression to try to identify the unique contributing factors. In this case, the Log pop. % White % Af. Am % Hispanic % Urban % Renters Log income classification problem is -0.0858 to identify the sender in each asymmetric pair. As0.0231 before, all features are difference 0.968 0.0703 0.0094 0.0612 0.0546 standardized. The feature0.0065 weights from training on0.0098 the full dataset are shown in Table 5.0.0113 Only the s.e. 0.0543 0.0063 0.0054 0.0041 coefficients for17.8 population-13.2 size and percentage of 0.950 African Americans are statistically significant. z-score 11.1 11.3 5.67 4.82 In 5-fold cross-validation, this classifier achieves 82% accuracy (population alone achieves 78% accuracy; without the population attributes feature, accuracy is 77%). Table 4: Differences in demographic between senders and receivers of lexical influence. Bold typeface indicates statistical significance at p < .01.

% White % of Af.each Am demographic % Hispanic feature. % Urban % Renters Log income proximity, andLog thepop. absolute difference All features are standardized. 2.22 -0.246 1.08intervals0.0914 -0.129 -0.0133 Theweights resulting coefficients and confidence are shown in Table 3(a). Product of0.225 populas.e.geographical 0.290proximity, 0.315and similar 0.343proportions 0.229 0.221 0.180 0.194 tions, of African Americans are the most clearly t-score predictors. 7.68 Even-0.78 3.15 for geography 0.40 important after accounting and-0.557 population-0.0736 size, language1.16 change is significantly more likely to be propagated between regions that are demographically similar — Table 5: Regression coefficients forand predicting direction of influence. Bold typeface indicates statistical signifparticularly with respect to race ethnicity. icance at p < .01.

Finally, we consider the impact of removing features 10 on the accuracy of classification for whether a pair of MSAs are linguistically linked by our model. Table 3(b) shows the classification accuracy, computed over five-fold cross-validation. Removing the population feature impairs accuracy to a small extent; removing either of the other two feature sets makes accuracy noticeably worse.



5.2

Remember biases in sample: Twitter skews to minorities, though exact demographic composition of ourinfluence sample is not known Asymmetric properties of linguistic

Next we evaluate asymmetric properties of linguistic influence. The goal is to identify the features that make metropolitan areas more likely to send rather than receive lexical influence. Places that possess many of these characteristics are likely to have originated or at least popularized many of 57 the neologisms observed in social media text. We consider the characteristics of the 466 pairs of

Tentative conclusions • • •

Strong roles for both geography and demographics



But demographics remarkable, given metropolitan area-level aggregation



Need neighborhood-level aggregation?

Racial homophily most important?



Note socioeconomic class difficult to assess from this Census data

Issues with word independence assumption?



e.g. bruh related to bro?

58

Vanilla topic models

Latent variable models (MCMC or non-convex opt.) Generalized linear models (Convex optimization)

... with (partial) labels

... with lexical constraints

Document classification, regression Hypothesis tests etc. for word statistics

Correlations, ratios, counts (No optimization)

Dictionary word counting

wo rd s

ra l do ly-la cu be m le en d ts H an do d-la cu be m le en d ts H di an ct dio b na uil rie t s

Word counting

N at u

Ba re

Computational / statistical complexity

Taxonomy of text analysis methods

59

Data/domain assumptions

Linguistic sophistication

Latent variable models (MCMC or non-convex opt.)

Discourse

Generalized linear models (Convex optimization)

Frames Opinions

Correlations, ratios, counts (No optimization)

Syntax

Names

wo rd s

ra l do ly-la cu be m le en d ts H an do d-la cu be m le en d ts H di an ct dio b na uil rie t s

Phrases

N at u

Ba re

Computational / statistical complexity

Taxonomy of text analysis methods

60

Data/domain assumptions

Thanks! All papers, slides available at http://brenocon.com Feedback welcome!

Opinions and Time Brendan O’Connor, Ramnath Balasubramanyan, Bryan Routledge, Noah Smith

Language and Geography Jacob Eisenstein, Brendan O’Connor, Noah Smith, Eric Xing