Using Retweet Information as a Feature to Classify ...

4 downloads 88 Views 975KB Size Report
Apr 4, 2017 - emoticons as features for sentiment classification [3]. • Reyes et al.: hashtags to ... Restriction: Twitter has a low limit of requests for messages.
Using Retweet Information as a Feature to Classify Messages Contents

David Burth Kurka ([email protected]) Alan Godoy ([email protected]) Fernando J. Von Zuben ([email protected]) 4th April 2017

1

What and why?

What?

We investigate the use of machine learning algorithms to classify the topic of messages published in Online Social Networks not using any information about their content, but only data about which user shared each message The big question: To which extent is it possible to understand and predict aspects of processes happening on a social network – virtual or not – from its users behaviour?

2

Why?

Challenges in working with online social networks: • Content usually not structured – mostly written text and images • Presence of slang, abbreviations, non-verbal information (as emoticons or emojis) and irony

3

Why?

Challenges in working with online social networks: • Dependence on social context • In microblogging – e.g., Twitter – these issues are intensified due to restricted post length: • Lower efficiency of traditional text mining techniques, as topic detection and sentiment analysis

4

What others are doing? Two main approaches: • Content-centered: • NLP to classify messages • Hu et al.: emoticons as features for sentiment classification [3] • Reyes et al.: hashtags to classify irony [4]

• Metadata-centered: • Cataldi et al.: combine content with users’ ‘authority’ to classify and identify topics [2] • Suh et al.: use of number of connections and number of messages by users to help on prediction of popularity of messages [5] • Baba et al.: users behaviors and community detection to group similar messages [1] 5

What?

Our solution: Use extra information, either to avoid the complexities of natural language processing or to supplement data obtained through such classical approaches • We did this by using the set of users that shared a specific content as a feature vector for a supervised classifier, trained to infer the subject of such content

6

How?

How?

Twitter was chosen as source of data • Simplicity of its basic content – the tweet, a short text message • Amount of public content available through APIs Information cascades: • Phenomena of creation, replication and transformation of content by OSN users • Twitter has a simple mechanism of creating content cascades – retweets

7

How?

• Restriction: Twitter has a low limit of requests for messages already published, making impossible the posterior analysis of known diffusions • Alternative: Stream API – extraction of data in real-time • Challenge: we need to determine a priori where a diffusion process will begin • Solution: Focus the analysis on cascades started by popular users (e.g., public figures or news agencies)

8

How?

How to define the ground-truth? • Manual classification: too time consuming • Using topic detection algorithms, as LSA or LDA, on the full text of the news – limited to the quality achieved by the topic detection • Solution: use as labels the division of news in thematic sections made by the newspaper editorial staff • Limitations: • Each message is attributed to only one class • It does not allow the use of multiple sources (each newspaper has its own taxonomy)

9

How?

• One central user: Folha de S˜ao Paulo – https://twitter.com/folha –, the largest Brazilian newspaper • More than 3 million followers when collection was started • All tweets by Folha de S˜ao Paulo and all retweets of this content by other users were collected

• Period: from March 19, 2014 to September 21, 2014 • 13463 distinct and original messages posted by the source

10

Dataset

Filters: 1. Messages with at least 20 retweets: 4671 distinct messages 2. Messages belonging the six main categories (“everyday news”, “sports”, “world”, “politics”, “entertainment” and “market”): 3185 distinct messages 3. Automated scripts (bots) were removed 4. To balance the topics, a maximum limit of 450 messages was set for each category

11

Dataset

Messages Retweets Most popular message Density of retweets Number of users Connections Average degree Clustering Highest in-degree Diameter

2,444 111,402 (user: 2.49 / message: 45.58) 739 retweets 0.10% 44,627 686,326 30.76 (hkin i = hkout i = 15.21) 3.82% (for a random network it is 0.07%) @UOL – 5793 followers 16

12

Dataset

13

Dataset

14

How

Algorithms: • Classical machine learning algorithms were used: • k-nearest neighbors (k-NN) • Logistic regression

• Why? Few parameters to optimize and less prone to overfitting (we have a small data set)

15

What we found?

k-NN

• Class of a new sample is defined by the most common class among the k closest neighbors • Jaccard distance

16

k-NN

17

Logistic regression

• Linear model • L2 regularization with parameter C = 1 • Accuracy: 48.75 ± 1.74%

18

Logistic regression

Table 1: Classes: (a) everyday news; (b) sports; (c) world; (d) politics; (e) entertainment; (f) market

(a) (b) (c) Expected (d) (e) (f) Total Precision

(a) 35 9 14 11 10 13 92 38.0%

(b) 9 62 7 10 15 8 111 55.9%

Observed (c) (d) 12 21 12 8 42 10 5 54 13 3 12 11 96 107 43.8% 50.5%

(e) 9 6 6 1 39 3 64 60.9

(f) 4 2 1 4 2 6 19 31.6%

Total 90 99 80 85 82 53

Recall 38.9% 62.6% 52.5% 63.5% 47.6% 11.3%

19

Logistic regression Expected: politics, found: sports Pal´ acio do Planalto’s [seat of Brazilian government] computer was used to edit Temer [vice-president at that time] and Ideli [senator] pages. http://t.co/ wuqblBe2lD Marco Feliciano [congressman] says he sees contradiction in criticism from PT [political party] to Marina Silva [candidate to presidency]. http://t.co/ IZOEGLnAZL #FolhaninTheWorldCup protesters against the World Cup invade federal government seminar in Recife. http://t.co/iP3CLCC7Nn #FolhaElections Eduardo Jorge [candidate to presidency] speaks live to ’TV Folha’ at 4pm. http://t.co/GHB3VbYdgW Viaduct collapses and leaves at least one dead in Belo Horizonte, in addition to several injured people. http://t.co/BMTZuuCUN2 Parents of player Khedira are robbed in Recife, says German newspaper. http: //t.co/52FnDzKAW2 #FolhaInTheWorldCup Protester plays with ball during police siege to a protest in Porto Alegre. http://t.co/6W4U4tqDbY Lula [former Brazilian president] talks about increasing political action if Dilma [Brazilian president at that time] is reelected. http://t.co/xSXFIV6uHl 20

Logistic regression Expected: everyday news, found: politics Senator Aloysio Nunes does not dismiss the idea of being vice candidate of A´ ecio Neves [pre candidate to presidency]. http://t.co/qyCQMbhrKG Subway workers reject counter offer and should block trains this Thursday in SP [Brazilian state]. http://t.co/emu63AYHfx After fire in CTG [community center], gay wedding is celebrated at RS [Brazilian state] court: http://t.co/hWIovrnDoW Alckmin [S˜ ao Paulo state governor] will sanction this week new law to forbid masks in protests. http://t.co/EkMmwM7jHk Franca [city] assembles a catalog of museum collection to be available on the Internet. http://t.co/2gg8bFTLkd Doctor ’expelled’ from SUS [public health system] writes in a book his daily routine in the system: http://t.co/IOpMk3xxbJ House of Representatives approves project that equalize pharmacies and health facilities. http://t.co/ZP3bxhDCMk Artifacts found with activist arrested in protest against the World Cup are not explosives, says report. http://t.co/xEKqz4mwhA

21

Summing up

Analysis of the work

Main finding: It is possible to classify messages between six different classes with an accuracy rate of near 50% knowing only the users that shared such message • The structure of an information diffusion process contains relevant information about the content of messages that are being exchanged in the network

22

Analysis of the work

• This method may be used to assist traditional content identification strategies (e.g., NLP) in challenging contexts, or be applied in scenarios with limited information access • This kind of technique can be applied on the development of minimally invasive classifiers, able to organize even encrypted data • Independent of language • Risk to privacy: aspects of a message’s content can be drawn solely from OSN meta-data

23

Limitations and future works

• The target users are fixed in training phase • How can we extend this result to users not seen in training?

• Focusing only on one source may limit the generality of the results here presented • Further work may explore multiple classes, by using classification schemes that allow multiple classes for each message

24

References i

S. Baba, F. Toriumi, T. Sakaki, K. Shinoda, S. Kurihara, K. Kazama, and I. Noda. Classification method for shared information on Twitter without text data. In Proceedings of the 24th International Conference on World Wide Web - WWW ’15 Companion, pages 1173–1178, New York, USA, 2015. ACM Press.

25

References ii

M. Cataldi, L. Di Caro, and C. Schifanella. Emerging topic detection on Twitter based on temporal and social terms evaluation. In Proceedings of the Tenth International Workshop on Multimedia Data Mining - MDMKDD ’10, pages 1–10, New York, USA, 2010. ACM Press. X. Hu, J. Tang, H. Gao, and H. Liu. Unsupervised sentiment analysis with emotional signals. In International Conference on World Wide Web, pages 607–617, Rio de Janeiro, Brazil, may 2013. International World Wide Web Conferences Steering Committee.

26

References iii

A. Reyes, P. Rosso, and T. Veale. A multidimensional approach for detecting irony in Twitter. Language Resources and Evaluation, 47(1):239–268, jul 2012. B. Suh, L. Hong, P. Pirolli, and E. H. Chi. Want to be retweeted? large scale analytics on factors impacting retweet in Twitter network. In 2010 IEEE Second International Conference on Social Computing, pages 177–184. IEEE, aug 2010.

27

Suggest Documents