Nov 16, 2012 - ... from natural language processing, information retrieval, data mining, machine learning as ... stock p
Social Text Data Analysis Brendan O’Connor Machine Learning Department Carnegie Mellon University Talk at UChicago Computational Social Science Workshop seminar
November 16, 2012
What can statistical text analysis tell us about society?
• • • •
Manual content analysis: analyze ideas, concepts, opinions etc. in text. Long and rich history (Krippendorff 2012) -- but very labor-intensive From manual to automated content analysis
• • •
Quantitative comparisons Pattern recognition Qualitative drilldowns
Many emerging examples of automated text content analysis [Political science, media studies, economics, psychology, sociology of science, sociolinguistics, public health, history, literature...]
Appropriating tools from natural language processing, information retrieval, data mining, machine learning as quantitative social science methodology 2
Text as measurement?: concepts
U.S. convention speeches’ word frequencies, by party http://www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html
following the election of a Labor government in Israel in the summer of 1992. As with the case of the USA-USSR interactions, this time series gives a more exact measure of
Text as measurement: events
the patterns of events over time. For example, while the intifada follows a lull in conflict during the summer of 1987, the event data also show a general increase in conflict beginning about 18 months earlier. This increase may have been a precursor to the larger uprising.
Figure 2 Israel-Palestinian interactions, 1982-1992
Net Cooperation
0 -100 -200 -300
ISR>PAL
-400
PAL>ISR
-500
92-07
92-01
91-07
91-01
90-07
90-01
89-07
89-01
88-07
88-01
87-07
87-01
86-07
86-01
85-07
85-01
84-07
84-01
83-07
83-01
82-07
82-01
-600
As these two figures illustrate, event data can be used to summarize the overall relationship
Automatic event extraction from news reports summaries of the interactions found in historical sources, but unlike narrative accounts, event data can (Schrodt 1993) be subjected to statistical analysis. As a consequence, event data are frequently used to study foreign between two countries over time. The patterns shown by event data usually correspond to the narrative
policy outcomes and some characteristics of the international environment within which foreign policy decisions occur.
4
Taxonomy of text analysis methods Vanilla topic models
... with (partial) labels
... with lexical constraints
Document classification, regression Hypothesis tests etc. for word statistics Word counting
5
Dictionary word counting
Taxonomy of text analysis methods Computational / statistical complexity
Vanilla topic models
... with (partial) labels
... with lexical constraints
Document classification, regression Hypothesis tests etc. for word statistics Word counting
Dictionary word counting
Data/domain assumptions 5
Ba re
Can interpret text only through context
wo rd s
Data and Prior Knowledge
Data/domain assumptions 6
Data and Prior Knowledge
wo rd s
N at u
Ba re
Can interpret text only through context
ra l do ly-la cu be m le en d ts
Timestamps, author attributes, conversational context ...
Data/domain assumptions 6
Data and Prior Knowledge Associated responses: stock prices, voting decisions, citation networks ...
wo rd s
N at u
Ba re
Can interpret text only through context
ra l do ly-la cu be m le en d ts
Timestamps, author attributes, conversational context ...
Data/domain assumptions 6
Data and Prior Knowledge Topical categories, genres, event types, types of author expressions...
Associated responses: stock prices, voting decisions, citation networks ...
wo rd s
N at u
Ba re
Can interpret text only through context
ra l do ly-la cu be m le en d ts H an do d-la cu be m le en d ts
Timestamps, author attributes, conversational context ...
Data/domain assumptions 6
Data and Prior Knowledge Topical categories, genres, event types, types of author expressions...
Associated responses: stock prices, voting decisions, citation networks ...
Topic keywords, sentiment lexicons, name lists, verb lists ...
wo rd s
N at u
Ba re
Can interpret text only through context
ra l do ly-la cu be m le en d ts H an do d-la cu be m le en d ts H di an ct dio b na uil rie t s
Timestamps, author attributes, conversational context ...
Data/domain assumptions 6
Detecting cultural phenomena in textual social media Opinions and Time Brendan O’Connor, Ramnath Balasubramanyan, Bryan Routledge, Noah Smith
Language and Geography Jacob Eisenstein, Brendan O’Connor, Noah Smith, Eric Xing
Internet Censorship David Bamman, Brendan O’Connor, Noah Smith
Detecting cultural phenomena in textual social media Opinions and Time Brendan O’Connor, Ramnath Balasubramanyan, Bryan Routledge, Noah Smith
From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series. ICWSM 2010
Measuring public opinion through social media? !"#$%"&'(&)*+*&
approve
disapprove
9
Measuring public opinion through social media? !"#$%"&'(&)*+*& !"#$%&
approve
disapprove
9
Measuring public opinion through social media? !"#$%"&'(&)*+*& !"#$%&
approve
disapprove
9
Measuring public opinion through social media? !"#$%"&'(&)*+*& !"#$%&
approve
disapprove
!"#$%&
!"#$%&
9
!""#$"%&$' ($)&'*$+,-$+&' .$%/0#$'
Measuring public opinion through social media? !"#$%"&'(&)*+*& !"#$%&
approve
!"#$%&$'&()*&$"$$ +),)-"($,&"+.(&,/0$
disapprove
!"#$%&
!"#$%&
9
!""#$"%&$' ($)&'*$+,-$+&' .$%/0#$'
Method
10
Method Poll Data • Consumer confidence,
2008-2009 • Index of Consumer Sentiment (Reuters/ Michigan) • Gallup Daily
• 2008 Presidential Elections • Aggregation, Pollster.com
• 2009 Presidential Job Approval • Gallup Daily
Which tweets correspond to which polls? 10
Method Poll Data
Topic Keywords
• Consumer confidence,
Subset of Gardenhose public tweets over 2008-2009
2008-2009 • Index of Consumer Sentiment (Reuters/ Michigan) • Gallup Daily
• 2008 Presidential Elections • Aggregation, Pollster.com
• 2009 Presidential Job Approval • Gallup Daily
Which tweets correspond to which polls?
“economy” “jobs” “job” “obama” “mccain”
“obama” Analyzed subsets of messages that contained manually selected topic keyword 10
Method Poll Data
Topic Keywords
• Consumer confidence,
Subset of Gardenhose public tweets over 2008-2009
2008-2009 • Index of Consumer Sentiment (Reuters/ Michigan) • Gallup Daily
• 2008 Presidential Elections • Aggregation, Pollster.com
• 2009 Presidential Job Approval • Gallup Daily
Which tweets correspond to which polls?
Sentiment Analysis Subjectivity Clues lexicon from OpinionFinder (Wilson et al 2005)
“economy” “jobs” “job”
MessageCountt (pos. word AND top MessageCount t (neg. word sentiment (topic_word) = AND top t
“obama” “mccain”
“obama” Analyzed subsets of messages that contained manually selected topic keyword 10
=
p(pos. word | topic word, t) p(neg. word | topic word, t)
A note on the sentiment list • Not well-suited for social media English •
“sucks” “:)” “:(“
(Top examples) word valence will positive bad negative good positive help positive
count 3934 3402 2655 1971
(Random examples) word valence funny positive fantastic positive cornerstone positive slump negative bearish negative crackdown negative 11
count 114 37 2 85 17 5
Smoothed comparisons:
“jobs” sentiment vs. consumer confidence
12
Smoothed comparisons:
“jobs” sentiment vs. consumer confidence
13
Smoothed comparisons:
“jobs” sentiment vs. consumer confidence
14
Smoothed comparisons:
“jobs” sentiment vs. consumer confidence
15
Smoothed comparisons:
“jobs” sentiment vs. consumer confidence
16
Smoothed comparisons:
“jobs” sentiment vs. consumer confidence
17
Smoothed comparisons:
“jobs” sentiment vs. consumer confidence
Sept 15 2008: Lehman collapse
Mar 2009: Stock market begins recovery
18
Which leads, poll or text?
• •
Cross-corr between
•
Sentiment score on day t
•
Poll day t+L
sentiment(“jobs”) is leading indicator for poll
19
Keyword message selection 15-day windows, no lag sentiment(“jobs”)! ! ! ! ! r = +0.80 sentiment(“job”)! ! ! ! ! ! r = +0.07 sentiment(“economy”)! ! ! r = –0.10
Look out for stemming sentiment(“jobs” OR “job”) ! r = +0.40
20
Presidential elections [doesn’t work]
•
2008 elections sen(“obama”), sen(“mccain”) do not correlate to polls
21
Presidential elections [doesn’t work] Presidential job approval [~works]
•
2008 elections sen(“obama”), sen(“mccain”) do not correlate to polls
•
2009 job approval sen(“obama”) => r = 0.72 Looks easy: simple decline
21
election.twitter.com Proprietary sentiment analyzer over Obama vs Romney name-containing tweets
http://about.topsy.com/wp-content/uploads/2012/08/Twindex-report1.pdf
election.twitter.com Proprietary sentiment analyzer over Obama vs Romney name-containing tweets
http://about.topsy.com/wp-content/uploads/2012/08/Twindex-report1.pdf
Twitter and Polls
• Preliminary results that sentiment analysis on Twitter data can give •
•
• •
information similar to opinion polls • But, still not well-understood! Who is using Twitter? • Massive changes over time (2008 Twitter != 2012 Twitter) • News vs. opinion? • Other data sources might better distinguish? Better text analysis • Very wide linguistic variation on Twitter • Word sense ambiguity: “steve jobs” • Better data sources Suggestion for future work: analyze correlations to pre-existing surveys and other attitude measurements Not a replacement for polls, but seems potentially useful. Between ethnography and surveys?
• See also
"I Wanted to Predict Elections with Twitter and all I got was this Lousy Paper" -- A Balanced Survey on Election Prediction using Twitter Data Daniel Gayo-Avello, arXiv 2012 23
Computational / statistical complexity
Taxonomy of text analysis methods
Correlations, ratios, counts Word counting
wo rd s
ra l do ly-la cu be m le en d ts H an do d-la cu be m le en d ts H di an ct dio b na uil rie t s
Mar 2009: Stock market
N at u
Ba re
Sept 15 2008:
Dictionary word counting
24
Data/domain assumptions
Detecting cultural phenomena in textual social media Opinions and Time Brendan O’Connor, Ramnath Balasubramanyan, Bryan Routledge, NoahASmith Latent Variable
Model for Geographic Lexical Variation. EMNLP 2010.
Internet Censorship
David Bamman, Brendan O’Connor, Noah Smith
Language and Geography Jacob Eisenstein, Brendan O’Connor, Noah Smith, Eric Xing
•
Languages exhibit variation, reflecting geography, status, race, gender, etc. 26
!"#$%&'()*+,$*-'#."%/*'(*0,%'#.*1"-'#
!
2("*#33$,#%&4*0"#$%&*+,$*5(,6(*7#$'#8."* #./"$(#/',(09*":):*;, 1,818 (used at least 5 times in one week within one MSA)
46
Diffusion of novel words weeks 1-8
weeks 35-43
weeks 78-86
bruh !
!
! ! ! ! !!
af
! !! ! !! !!! ! !
!
!
! !! !! ! ! ! !!! ! ! !!! !! ! ! !
!
! !!! !!!! !! ! !!! ! ! ! ! ! ! ! ! ! !! !! ! !! ! !! ! ! ! ! !! ! ! ! ! !!!!!!! ! ! !! !!!! ! ! ! ! ! ! ! !! !
! ! !! !!! ! !
!!
!
!! ! ! !!
! !
!
!
-__-
! ! ! ! ! ! ! !
!! !! !! !!! ! ! ! !! ! ! ! !! ! ! ! ! ! ! !
!! !!
!!! !
! !
! ! !! !
! ! ! ! ! ! !!! ! ! ! !!! ! ! !
! !
! !!! !!!!!! ! ! ! !! ! !! ! !! !! ! ! ! !! !! ! ! ! ! ! ! !! ! ! ! ! !!!! ! !! ! !! ! ! ! ! ! ! ! ! ! ! ! !!!! ! ! ! ! !! ! !!! ! ! ! !! ! ! !! !!! ! ! ! ! !! !!
Figure 1: Change in popularity for three words: bruh, af, - -. Blue circles indicate cities in which the probability of an author using the word in a week in the listed period is greater than 1% (2.5% for bruh).
Clearly geography plays some role in each example — most notably in bruh, which may reference southeastern phonology. But each example also features long range “jumps” over huge parts of the 47
Model • •
MSAs as nodes in diffusion network
•
Can’t use simple cross-correlations: confounds for size of region, global events (e.g. TV shows, holidays), etc.
•
Logistic binomial probability whether an author uses a word at a particular (region, time)
•
“Influence” represented as linear dynamical system (latent vector autoregression)
“Influence”/“Transmission” from r -> s: if words popular in region r later become popular in s. [Lead-lag notion of influence/transmission]
48
(k) ωi,r,t ci,r,t
weight of particle k in the forward pass of the sequential Monte Carlo algorithm count of individuals who use word i in region r at time t del: s count of individuals who post messages inand region r at time t r,t word frequencies by time region Table 1: Summary of mathematical notation. The index i indicates words, r indicates re indicates (weeks). probability of using word i in region r at time t θi,r,t timeestimated cνi,r,t ∼overall Binomial(s , σ(ν + τ + η + η )) (1) r,t i r,t i,∗,t i,r,t log-frequency of word i i τr,t general activation of region r at time t count of authors count of authors who of overall (logodds) η global activation word ifor at time t activation We general of global activationthe of i,∗,t ents A govern lexical diffusion all words. rewrite sum ignored. yields theinlinear model: who use word iThis in post messages frequency of word i region region r rat time t word i at time t η activation of word i at time t in i,r,t r at time t region r at time t hr ηregion , where η i,t concatenation is the vertical concatenation eachof ηsize and i,r,t i,t η i,t �, aofvector vertical of each ηi,r,t and ηi,∗,t R+ 1 xi,r,t x η ∼ Normal(Aη , Γ) σ(x)η= c ∼ Binomial(s , σ(ν + τ + η E[η ] = A η r,t i r,t i, i,tpicks i,t−1 ctor that thefunction, elements and η . i,t,r r,s i,t−1,s σ(·) theout logistic e /(1 + e ) i,r,t i,∗,t
Model:
s
autoregressive coefficients (size R × R)lexical whereAthe region-to-region coefficients A govern diffusion for allcoefwords. W nfidence intervals around the cross-regional autoregression Γ variance of theproduct autoregressive process (size R the × R) η + η as a vector h η , where η is vertical concatenation i,∗,t i,r,t r i,t i,t s a function of the regional-temporal word activations ηi,r,t . We ηi,∗,t , and h indicator vector that picks out the elements ηi,r,t and ηi,∗,t r is a row Model fitting with EM parameter of the Taylor approximation to the logistic binomial likelihood i,r,t putingζsamples for the trajectories η i,r , and then computing point • multimate Gaussian pseudo-emission in the Kalman smoother i,r,t Our goal is to estimate confidence intervals around the cross-regional au gregating over all words i. Bayesian confidence intervals can then 2 (30 million variables, a smoother while) Σ emission in the ficients arevariance computed as Kalman atakes function of the regional-temporal word act i,r,tA,•which mates, ofapproach, the form of the estimator to compute A.the takeregardless a(k)Monte Carlo computing samples for used the trajectories η i,r , and Fittingofeta: Gaussian approximation to logistic-binom • ω weight particle k in the forward pass of the sequential Monteconfidenc Carlo alg e detail. i,r,t of A for each sample, aggregating over all words estimates i. Bayesian
•
be computed from these point estimates, of the form of the estimator u Kalman smoother to learnregardless diag(A); non-diagonals from Summary of mathematical notation. The index i indicates words, r indic WeTable now 1: discuss these steps in more detail. VAR fits to eta samples indicates time (weeks).
imation of word activations
49
Smoothing examples 0.14 3 2
Cleveland, OH
0.1 p(w)
1 !t
Philadelphia, PA
0.12
0 −1
0.08
0.04
−3
0.02 10
20
30
40 50 week
60
70
80
Erie, PA
0.06
−2 −4
Pittsburgh, PA
0
10
20
30
40 50 week
60
70
80
Figure 2: Left: Monte Carlo and Kalman smoothed estimates of η for the word ctfu in five cities. Right: estimates of term frequency in each city. Circles show empirical frequencies.
4.2 Estimation of system dynamics parameters The parameter Γ controls the variance of the latent state, and the diagonal elements of A control the amount of autocorrelation within each region. The maximum likelihood settings of these parameters are closely tied to the latent state η. For this reason, we estimate these parameters using an expectation-maximization algorithm, iterating between maximum-likelihood estimation of Γ and diag(A) and computation of a variational bound on 50P (η i |ci , s; Γ, A). We run the sequential Monte Carlo procedure described above only after converging to a local optimum for Γ and diag(A). This
Estimating cross-region effects
maximum-likelihood estimate for the system coefficients is ordinary least squares, � �−1 A = η T1:T −1 η 1:T −1 η T2:T η 1:T −1
Transmission edge selection
• •
egression Simulated includes aposterior bias term.samples We experimented of A matrix with ridge regression for regulariza lizing the squared Frobenius norm of A), comparing regularization coefficients by c of the 39,600 non-diagonals (edges) are ation withinWhich the time series. We found that regularization cansignificantly improve thenon-zero? likelihood on h Multiple hypothesis testing words, but when words are aggregated, regulariza ata when computing A for individual s no improvement.
•
we compute confidence estimatestoo for A. We compute A(k) for each sampled sequence η Bonferroni correction conservative: ming over all words. Wefamilywise can then compute Monte Carlo-based confidence intervals around controlling error rate �K (k) �K (k 1 1 2 P( >= 1a false positive) Am,n , by fitting Gaussian to the samples: µm,n = K k Am,n and σm,n = K k (Am )2 . To identify coefficients that are very likely to be non-zero, we compute Wald tests u Benjamini-Hochberg: control the 2 res zm,n = µm,n /σm,n . We apply the Benjamini-Hochberg False Discovery Rate correc False Discovery Rate ultiple hypothesis tests [34], computing a one-tailed z-score threshold z¯ such that the E[ % false positive rtion of false positives will be] z¯}
ficients thatyields pass 3544 this threshold edges can be confidently considered to indicate cross-regional i in the autoregressive model. 51
Transmission network ! !
t n i r p e R t o N !
! !
! ! ! ! ! ! !! !
!
o D
!
!
!
!
!
!
!!
!
! !
!
!
!
! ! !
!
!
!
! !
!
!
!
t n i r p e R t o N !
!
!
!
!
!
!
!
!
!
! !
!
! !
o D
!
! !
!
Figure 4: Lexical influence network: high-confidence, high-influence links (z > 3.12, µ > 0.025). Left: among all 50 largest MSAs. Right: subnetwork for the Eastern seaboard area. See also Figure 5, which uses same colors for cities. geo distance 10.5 ± 0.5 20.8 ± 0.6
% White 10.2 ± 0.4 16.2 ± 0.5
% Af. Am. % Hispanic % urban %renter log income linked 8.45 ± 0.37 9.88 ± 0.64 9.55 ± 0.34 6.30 ± 0.24 0.181 ± 0.006 unlinked 16.3 ± 0.5 15.2 ± 0.8 12.0 ± 0.4 6.78 ± 0.25 0.201 ± 0.007 52 Table 2: Geographical distances and absolute demographic differences for linked and non-linked pairs of
Transmission ego networks t n i r p e R t o N
New York, NY
o D
o D
t n i r p e R t o N
Los Angeles, CA
o D
t n i r p e R t o N Chicago, IL
o D 53
t n i r p e R t o N Dallas, TX
Houston, TX
Miami, FL
Washington, DC
Atlanta, GA
Boston, MA−NH
Cleveland, OH
Orlando, FL
San Antonio, TX
Kansas City, MO
Las Vegas, NV
Phoenix, AZ
San Francisco, CA
Riverside, CA
Seattle, WA
Columbus, OH
Charlotte, NC
Indianapolis, IN
Austin, TX
San Diego, CA
St. Louis, MO
Tampa, FL
Baltimore, MD
Providence, RI
Nashville, TN
Milwaukee, WI
Jacksonville, FL
Pittsburgh, PA
Portland, OR
Cincinnati, OH
Sacramento, CA
Louisville, KY
Richmond, VA
Oklahoma City, OK
Hartford, CT
Orlando, FL
San Antonio, TX
Kansas City, MO
Las Vegas, NV
Birmingham, AL
Salt Lake City, UT
Raleigh, NC
Buffalo, NY
Columbus, OH
Charlotte, NC
Indianapolis, IN
Austin, TX
Detroit, MI
San Jose, CA
Minneapolis, MN
Virginia Beach, VA
Denver, CO
Memphis, TN
Cleveland, OH
New Orleans, LA
San Jose, CA
t o N nt o i r D p e R
Figure 5: Ego networks for each of the 50 largest MSAs. Unlike Figure 4, all incoming and outgoing links are included (having z-score > 3.12). A blue link between cities indicates there is both an incoming and outgoing edge; green indicates outgoing-only; orange indicates incoming-only. Maps are ordered by population size. Official MSA names include up to three city names to54describe the area; we truncate to the first.
How does distance matter? Houston, TX
Dallas, TX
o D
o D
t n i r p e R t o N
t n i r p e R t o N
o D San Antonio, TX
Austin, TX
t n i r p e R t o N
o D 55
t n i r p e R t o N
!
Symmetric properties of transmission Figure 4: Lexical influence network: high-confidence, high-influence links (z > 3.12, µ > 0.025). Left: among all 50 largest MSAs. Right: subnetwork for the Eastern seaboard area. See also Figure 5, which uses same colors for cities.
linked unlinked
geo distance 10.5 ± 0.5 20.8 ± 0.6
% White 10.2 ± 0.4 16.2 ± 0.5
% Af. Am. 8.45 ± 0.37 16.3 ± 0.5
% Hispanic 9.88 ± 0.64 15.2 ± 0.8
% urban 9.55 ± 0.34 12.0 ± 0.4
%renter 6.30 ± 0.24 6.78 ± 0.25
log income 0.181 ± 0.006 0.201 ± 0.007
Table 2: Geographical distances and absolute demographic differences for linked and non-linked pairs of MSAs. Confidence intervals are p < .01, two-tailed.
Table 3: Logistic regression to predict linked pairs of MSAs
confidence influence links to and from all other cities (also called ego networks). The role of geographical proximity is apparent: therepredicting are denseinfluence connections (a) Logistic regression coefficients links within regions such as the northeast, between Boldand typeface indicates statistical signifimidwest, and MSAs. west coast, relatively few cross-country connections.(b) Accuracy of predicting influence cance at p < .01.
links, with ablated feature sets.
By analyzing the properties of pairs of cities with significant influence, we can identify the geot-value feature set accuracy gap graphic and demographic driversestimate of linguistics.e. influence. First, we consider the ways in which simintercept -0.0601 0.0287 -2.10 all features 72.3 ilarity and proximity between regions causes them to share linguistic influence. Then we consider product of populations 0.393 0.048 8.22 -population 71.6 0.7 the asymmetric properties that cause some regions to be linguistically influential. distance -0.870 0.033 -26.1 -geography 67.6 4.7 abs. diff. % White -0.214 0.040 -5.39 -demographics 66.9 5.4 abs. diff. %properties Af. Am. of linguistic -0.693 influence 0.042 -16.7 5.1 Symmetric abs. diff. % Hispanic -0.140 0.030 -4.63 abs.symmetric diff. % urban To assess properties of-0.170 linguistic0.030 influence,-5.76 we compare pairs of MSAs linked by nonabs. diff. % renters -0.0314randomly-selected 0.0304 -1.04pairs of MSAs. Specifically, we compute zero autoregressive coefficients, versus abs. diff.distribution log income over senders 0.0458 and0.0301 1.52 often each city fills each role), and we the empirical receivers (how
sample pairs from these distributions. The baseline thus includes the same distribution of MSAs as the experimental condition, randomizes theAf.associations. Even if our to income Log pop. but% White % Am % Hispanic % model Urban were % predisposed Renters Log detect difference influence among certain -0.0858 types of MSAs (for example, large or dense cities), 0.0231 that would not0.0546 0.968 0.0703 0.0094 0.0612 56 bias this analysis, since the aggregate makeup of the linked and non-linked pairs is identical. s.e. 0.0543 0.0065 0.0063 0.0098 0.0054 0.0041 0.0113
abs. diff. % Hispanic -0.140 0.030 -4.63 Table 4 shows the average -0.170 difference0.030 in population abs. diff. % urban -5.76 and demographic attributes between the sender and cities in each-0.0314 of the 4660.0304 asymmetric abs.receiver diff. % renters -1.04pairs. Senders have significantly higher populations, more Whites,1.52 more renters, greater income, and are more urban abs. diff. logAfrican income Americans, 0.0458 fewer 0.0301
Senders vs. Receivers
(p < 0.01 in all cases). Because demographic attributes and population levels are correlated, we again turn to logistic regression to try to identify the unique contributing factors. In this case, the Log pop. % White % Af. Am % Hispanic % Urban % Renters Log income classification problem is -0.0858 to identify the sender in each asymmetric pair. As0.0231 before, all features are difference 0.968 0.0703 0.0094 0.0612 0.0546 standardized. The feature0.0065 weights from training on0.0098 the full dataset are shown in Table 5.0.0113 Only the s.e. 0.0543 0.0063 0.0054 0.0041 coefficients for17.8 population-13.2 size and percentage of 0.950 African Americans are statistically significant. z-score 11.1 11.3 5.67 4.82 In 5-fold cross-validation, this classifier achieves 82% accuracy (population alone achieves 78% accuracy; without the population attributes feature, accuracy is 77%). Table 4: Differences in demographic between senders and receivers of lexical influence. Bold typeface indicates statistical significance at p < .01.
% White % of Af.each Am demographic % Hispanic feature. % Urban % Renters Log income proximity, andLog thepop. absolute difference All features are standardized. 2.22 -0.246 1.08intervals0.0914 -0.129 -0.0133 Theweights resulting coefficients and confidence are shown in Table 3(a). Product of0.225 populas.e.geographical 0.290proximity, 0.315and similar 0.343proportions 0.229 0.221 0.180 0.194 tions, of African Americans are the most clearly t-score predictors. 7.68 Even-0.78 3.15 for geography 0.40 important after accounting and-0.557 population-0.0736 size, language1.16 change is significantly more likely to be propagated between regions that are demographically similar — Table 5: Regression coefficients forand predicting direction of influence. Bold typeface indicates statistical signifparticularly with respect to race ethnicity. icance at p < .01.
Finally, we consider the impact of removing features 10 on the accuracy of classification for whether a pair of MSAs are linguistically linked by our model. Table 3(b) shows the classification accuracy, computed over five-fold cross-validation. Removing the population feature impairs accuracy to a small extent; removing either of the other two feature sets makes accuracy noticeably worse.
•
5.2
Remember biases in sample: Twitter skews to minorities, though exact demographic composition of ourinfluence sample is not known Asymmetric properties of linguistic
Next we evaluate asymmetric properties of linguistic influence. The goal is to identify the features that make metropolitan areas more likely to send rather than receive lexical influence. Places that possess many of these characteristics are likely to have originated or at least popularized many of 57 the neologisms observed in social media text. We consider the characteristics of the 466 pairs of
Tentative conclusions • • •
Strong roles for both geography and demographics
•
But demographics remarkable, given metropolitan area-level aggregation
•
Need neighborhood-level aggregation?
Racial homophily most important?
•
Note socioeconomic class difficult to assess from this Census data
Issues with word independence assumption?
•
e.g. bruh related to bro?
58
Vanilla topic models
Latent variable models (MCMC or non-convex opt.) Generalized linear models (Convex optimization)
... with (partial) labels
... with lexical constraints
Document classification, regression Hypothesis tests etc. for word statistics
Correlations, ratios, counts (No optimization)
Dictionary word counting
wo rd s
ra l do ly-la cu be m le en d ts H an do d-la cu be m le en d ts H di an ct dio b na uil rie t s
Word counting
N at u
Ba re
Computational / statistical complexity
Taxonomy of text analysis methods
59
Data/domain assumptions
Linguistic sophistication
Latent variable models (MCMC or non-convex opt.)
Discourse
Generalized linear models (Convex optimization)
Frames Opinions
Correlations, ratios, counts (No optimization)
Syntax
Names
wo rd s
ra l do ly-la cu be m le en d ts H an do d-la cu be m le en d ts H di an ct dio b na uil rie t s
Phrases
N at u
Ba re
Computational / statistical complexity
Taxonomy of text analysis methods
60
Data/domain assumptions
Thanks! All papers, slides available at http://brenocon.com Feedback welcome!
Opinions and Time Brendan O’Connor, Ramnath Balasubramanyan, Bryan Routledge, Noah Smith
Language and Geography Jacob Eisenstein, Brendan O’Connor, Noah Smith, Eric Xing