Measuring the Networked Public – Exploring

0 downloads 0 Views 44MB Size Report
of public communication on both a national and global scale. It does ...... number indicates overall popularity, they claim that mentions represent the name value.
Measuring the Networked Public – Exploring Network Science Methods for Large Scale Online Media Studies

Felix Victor Münch Bachelor of Science, Master of Arts

Submitted in fulfilment of the requirements for the degree of Doctor of Philosophy

Digital Media Research Centre Creative Industries Faculty Queensland University of Technology, Australia 2019

Except where otherwise noted, this work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA. Copyright 2019 Felix Victor Münch

To free education

Keywords Australia, abduction, big data, community, community detection, complex contagion, complex systems, complexity, contagion, gatekeeping, gatewatching, information diffusion, interdisciplinary research, issue publics, keyword extraction, mass communication, mixed methods, modularity maximisation, network analysis, networks, network science, social media, social networks, topology, online media, organised complexity, patterns, pragmatism, publics, public sphere, public sphericules, stochastic block models, Twitter, Twittersphere

i

Abstract Media and communication scholars and social scientists have always, at least implicitly, developed theories about the structures and dynamics of networks. Today, the internet, online news, and social media not only make these networks appear more visible, measurable, and complex, but their ubiquity also creates a need to deal explicitly with the network paradigm – in theory, empirical research, and in practice. However, while the emerging field of network science should be one of the most important auxiliary disciplines for media and communication studies, its take-up is hindered by the epistemological, teleological, methodological, and educational differences between the two fields. To understand these barriers to media and communication studies’ take-up of network science, this project concurrently explores the methods of the latter and the theory of the former in order to address questions about the structure and dynamics of public communication on both a national and global scale. It does so in two studies, each employing big social data under a network paradigm. One study focuses on the diffusion of media items on Twitter around acute events, and concepts of virality and contagion; the other focuses on the macro- and microstructures of the Australian Twitter follow network, community detection algorithms, and theoretical constructs of macro- and micro-publics – including the controversial notion of echo chambers. Thereby, this project leads to new, detailed, empirical evidence about the structure of communities and publics on Twitter, and the dynamics of information diffusion within these networks. In doing so, it reveals epistemological implications of the use of network analysis algorithms, and outlines a methodological framework to establish the bridgeheads necessary to better connect media and communication studies with network science. Therefore, the project’s two main outcomes are: 1) the implementation

ii

of new empirical procedures; and 2) a theoretical and epistemological framework that enables a researcher to develop, validate, and extend theory about a networked public sphere.

iii

Acknowledgements Doing a PhD is a once-in-a-lifetime experience. You learn a lot, you gain a lot, you lose a lot. It is a chapter in one’s life that only a few are privileged enough to write. And while at times it seems like a very lonely journey, it is supported and enabled by many. This support is the most important privilege of all. It cannot be taken for granted, and it cannot be acknowledged enough. Whenever I lost track on this year-long journey, literally half a world away from home, I reminded myself of the people helping me, cheering me on, and sometimes suffering with me, before and during this project. It was the most powerful motivator to keep me going. Therefore, close to the end of this journey, and rightfully at the very beginning of this thesis, it is time to say thank you. Family first. I want to thank my mother Eva Barbara Münch, who taught me that our most precious treasure lies in our head, in the form of knowledge, experience, and wisdom. When I was doubtful about even starting this project, she helped me to overcome my scepticism. I want to thank my sister Idunnu Anshelma Münch who, despite my being an often-absent brother, never seems to question our loyalty, and who was the first person to visit me in Australia. I want to thank Frederick Flegel, my late-encountered younger brother, for reaching out and helping me reflect on my life goals. I want to thank my grandfather, Helmut Münch, who asked every single time I called, whether I was a Doctor yet. I promise, I will visit more often now. I give the same promise to my great-aunt, Else Giersberg. Especially, because I deeply regret that I cannot visit her husband, and my great-uncle, Winfried Giersberg, anymore. I want to thank them both for their always-open house, which is one of the ‘happy places’ in my life. Even though it is awfully materialistic, I also have to thank them for their generosity. The PhD life is rich in experience and knowledge, but not in monetary gain.

iv

Thanks for helping out when necessary. My friends and colleagues. I’m afraid it is unavoidable (and I am sorry in advance) that not everybody is mentioned here who should be. If you are not mentioned, it does not indicate anything other than the fact that (ironically!) I am not very good at keeping track of my own social networks over time. I want to thank Brenda Moon for being a mentor in subjects of good programming practice, an experienced advisor in liferelated questions, and an awesome travelling companion. Thanks to Peta Mitchell for her support and collaboration that helped me understand the differences in disciplines that were a motivation for my study’s objectives. Thanks to Ehsan Dehghan for being the first reader of my chapter drafts, for our most inspiring conversations, and for giving me the confidence that my thesis was useful – at least for him. Thanks to Silvia Montaña, for filling our office and my heart with joy with her energetic temperament. Thanks to Patrick Sobecke, who made my life a lot easier during my first time in Brisbane. Thanks to Ricardo Candeias, for being my first friend in Australia, an animating flat mate, and trustworthy listener. Thanks to Umut Tasli, for the same reasons. (I hope we will meet again some sunny day.) Thanks to Meg Jing Zeng, for being the best neighbour and colleague that I could ever wish for. Without Meg, finding my place in Brisbane would have been much harder. Thanks to Michael Martin for so openly offering his friendship, and helping me to put things into perspective. I look forward to meet you both more often again, now that you are based in Europe. Thanks to Michael’s parents, for being such generous hosts. Thanks to Ariadna Matamoros Fernández for our collaboration and for being such a good companion and conversation partner. We should have had more drinks! Same holds true for Sara Ekberg: I thank her for being there when a drink with a friend was necessary. Thanks to Monique Mann for her friendship and the continuous reminders that a good PhD thesis is a finished PhD thesis. Thanks to Kelly Lewis for being an inspiring, passionate person, and for providing emotional support. Thanks to Malte Mertz, Stanley Kröger, Chris Werlin, Silke Weber, and Jana Gioia Baurmann for giving me shelter and support back in Hamburg during some of the darkest times of the last 4 years. Thanks to Heike Bormuth for a start of better times in Hamburg. Thanks to Teresa Geisler, for still being so open-hearted, and still having an almost clairvoyant understanding of my inner workings after all these years. And thanks to Dominik Thalmeier for being who he is: v

the most reliable, trustworthy, best friend I can imagine, despite every distance life puts between us. Naturally, I owe gratitude to, and want to thank my supervisors: Axel Bruns, for providing a most inspiring, open-ended, free (funded,) and secure environment in which this project could grow; Patrik Wikström, for always providing an open ear and keeping a critical stance, while at the same time being understanding, friendly, and supportive. I also want to thank Cornelius Puschmann for his external supervision of this project during an emergency visit to Germany. I see him as one of my informal mentors, together with Luca Rossi – grazie di tutto! Many thanks also to Christoph Neuberger. Without his recommendation to Axel Bruns, I would not be writing these lines. I want to thank the organisers of the CCI Digital Methods Summer School (2014) for giving me the first opportunity to present my ideas in front of an international academic audience. Thanks also to the Oxford Internet Institute (OII), especially Vicki Nash, Taha Yasseri, and Scott Hale. The OII Summer Doctoral Programme and its participants provided me with the motivation I needed to get back on track after half a year of hardship and concerns for my partner’s health. Especially during this time, the Hans Bredow Institute (HBI) in Hamburg gave me an academic haven as a visiting researcher. I want to particularly mention (besides Cornelius above) Lisa Merten, Christiane Matzen, Wiebke Loosen, and Jan-Hinrik Schmidt, to thank them for their organisational, professional, and social support. I am looking forward to continue my work as a research fellow at the HBI. At QUT, I want to thank the members of my Confirmation and Final Seminar panels, Tomasz Bednarz, Richi Nayak, and (especially) Jean Burgess (who was an observer and companion of this project from the very beginning), for their constructive and guiding feedback on this project in its respective stages. Furthermore, I want to thank the friendly and helpful staff of the Creative Industries Higher Degree Research team, who helped me all too many times to navigate the complications that my life inflicted on the progress of my research. After getting this research down on paper, my editor Denise Scott, despite an unlucky accident, endured a literally painstaking marathon to straighten out the remaining Germanisms that still shine through in my written English at times. Thanks for that. Also, I want to thank the, yet anonymous, vi

examiners for their constructive and instructive feedback. Looking further back, I want to thank the German Academic Scholarship Foundation (Studienstiftung des deutschen Volkes) for supporting me with a scholarship during my Bachelor and Master courses in Germany. However, I think I would not have made it even that far, without the high-quality public education system in Germany that is (mostly) free of tuition fees and other costs, and is financed by the German tax payer. I am grateful to have grown up in a country that (still) considers education to be a right and not a commodity. I hope that it stays that way, and that it will be an example for many more countries to follow. Finally, I want to apologise: I’m sorry, Nadine, for how much of our precious time this project has taken. Even if I have failed to make you understand and feel it, you’ve been the axis of my life during these years. Thanks for your patience, your kindness, and for the good times we had. I hope you’ll have many more.

vii

This thesis was written with an adapted template for writing a PhD thesis with Pandoc in Markdown (Pollard et al., 2016). Many thanks to all contributors of this and all related open source projects. Technically, the research was supported by QRIScloud and by use of the Nectar Research Cloud, a collaborative Australian research platform supported by the National Collaborative Research Infrastructure Strategy (NCRIS). Financially, this project was supported by the Australian Research Council through the ARC Future Fellowship project Understanding Intermedia Information Flows in the Australian Online Public Sphere and by the ARC LIEF project TrISMA: Tracking Infrastructure for Social Media Analysis (LIEF grant LE140100148), a QUT Postgraduate Research Award, and a QUT Excellence Top Up Scholarship.

viii

Table of Contents Keywords

i

Abstract

ii

Acknowledgements

iv

Abbreviations

xxiv

Statement of original authorship

xxvi

1 Introduction

1

1.1

Background, Project Goals, and Relevance . . . . . . . . . . . . . . . . .

2

1.2

On Interdisciplinarity . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.3

Chapter Structure and Summary . . . . . . . . . . . . . . . . . . . . . .

7

2 Literature Review 2.1

11

Recent Challenges in Public Communication . . . . . . . . . . . . . . . .

12

2.1.1

Fake News . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

2.1.2

Filter Bubbles and Echo Chambers . . . . . . . . . . . . . . . . .

15

2.1.3

The Networked Public Sphere . . . . . . . . . . . . . . . . . . . .

17

Social Media as a Source of Evidence . . . . . . . . . . . . . . . . . . . .

19

2.2.1

Research on Twitter . . . . . . . . . . . . . . . . . . . . . . . . .

20

2.2.2

Research on Facebook . . . . . . . . . . . . . . . . . . . . . . . .

28

2.2.3

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

2.3

Research Challenges for Media and Communication Studies . . . . . . .

31

2.4

The Media and Communication Studies Perspective . . . . . . . . . . .

35

2.4.1

The Rise of Networked Public Spheres . . . . . . . . . . . . . . .

35

2.4.2

Theories about Network Structures in the Public Sphere . . . . .

41

2.2

ix

2.5

2.4.2.1

Diffusion of News and Information . . . . . . . . . . . .

2.4.2.2

The Perceived Audience: From Undifferentiated Masses to Ad Hoc Issue Publics and Communities . . . . . . .

43

2.4.2.3

Gatekeeping . . . . . . . . . . . . . . . . . . . . . . . .

48

2.4.2.4

One-, Two-, and Multi-Step Flows . . . . . . . . . . . .

52

The Network Science Perspective . . . . . . . . . . . . . . . . . . . . . .

55

2.5.1

A Brief History of Network Science . . . . . . . . . . . . . . . . .

55

2.5.2

Drawing the Baseline: Understanding and Modelling the Growth of Networks and their Properties . . . . . . . . . . . . . . . . . .

62

2.5.3

Community Detection . . . . . . . . . . . . . . . . . . . . . . . .

67

2.5.4

How Things ‘Go Viral’: Contagion on Networks . . . . . . . . . .

69

2.5.4.1

Modelling of Contagion . . . . . . . . . . . . . . . . . .

70

2.5.4.2

Empirical Analysis of Contagion in Online Media Environments . . . . . . . . . . . . . . . . . . . . . . . . . .

76

Prediction of Contagion in Online Media Environments

78

Some Steps Further: Multilayer Networks . . . . . . . . . . . . .

79

2.5.4.3 2.5.5 2.6

41

Conclusion

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

3 Objectives

85

4 Methodology and Research Design

88

4.1

4.2

Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

88

4.1.1

Conflicting Epistemologies, Conflicting Methods? . . . . . . . . .

89

4.1.2

Organised Complexity, the Computational Turn, and Big Data .

91

4.1.3

Patterns, Deduction, Induction, and Abduction . . . . . . . . . .

96

4.1.4

Pragmatism and Mixed Methods as a Necessity . . . . . . . . . . 100

Research Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.2.1

General approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

4.2.2

Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

4.2.3

Research Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

4.2.4

Research Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.2.4.1

Stage 1: Interpretation of Theory . . . . . . . . . . . . 111

4.2.4.2

Stage 2: Exploration of Network Science Methods . . . 111 x

4.2.5

4.2.6

4.2.4.3

Stage 3: Reflection with Theory . . . . . . . . . . . . . 114

4.2.4.4

Stage 4: Evaluation . . . . . . . . . . . . . . . . . . . . 115

Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.2.5.1

Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

4.2.5.2

Data

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . . . 117

5 Study 1: Measuring Communication Cascades

122

5.1

Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

5.2

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

5.3

Cases and Data Description . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.3.1

5.3.2

5.4

#sydneysiege and #illridewithyou . . . . . . . . . . . . . . . . . 127 5.3.1.1

Event . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

5.3.1.2

Data Description . . . . . . . . . . . . . . . . . . . . . . 128

Brexit Repeat Referendum . . . . . . . . . . . . . . . . . . . . . 133 5.3.2.1

Event . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

5.3.2.2

Data Description . . . . . . . . . . . . . . . . . . . . . . 133

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.4.1

5.4.2

5.4.3

5.4.4

New and Returning Accounts over Time . . . . . . . . . . . . . . 138 5.4.1.1

Method and Results . . . . . . . . . . . . . . . . . . . . 138

5.4.1.2

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 140

Analysis of Diffusion Trees . . . . . . . . . . . . . . . . . . . . . 141 5.4.2.1

Reconstruction of Diffusion Tree Network . . . . . . . . 141

5.4.2.2

Connected Components . . . . . . . . . . . . . . . . . . 145

5.4.2.3

Closeness Centrality Distributions . . . . . . . . . . . . 151

5.4.2.4

Structural Virality . . . . . . . . . . . . . . . . . . . . . 163

Exposure Analysis over Time, Complex and Simple Contagion . 171 5.4.3.1

Exposure Network . . . . . . . . . . . . . . . . . . . . . 172

5.4.3.2

Typical Exposure Times

5.4.3.3

Number of Exposures . . . . . . . . . . . . . . . . . . . 180

5.4.3.4

Influence Network . . . . . . . . . . . . . . . . . . . . . 191

. . . . . . . . . . . . . . . . . 173

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

xi

5.5

Discussion of Theoretical Implications . . . . . . . . . . . . . . . . . . . 198

5.6

Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

5.7

Conclusion

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

6 Study 2: Mappings of a Public Sphere

208

6.1

Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

6.2

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

6.3

Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

6.4

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 6.4.1

Community Detection in the Full Australian Network, Based on Modularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 6.4.1.1

Modularity and Parallel Louvain Method . . . . . . . . 215

6.4.1.2

Comparison with Communities in Reduced Follow Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

6.4.2

6.5

6.4.1.3

Filtering by k-Cores . . . . . . . . . . . . . . . . . . . . 222

6.4.1.4

Activity Analysis . . . . . . . . . . . . . . . . . . . . . . 223

6.4.1.5

Keyword Extraction . . . . . . . . . . . . . . . . . . . . 226

6.4.1.6

Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . 237

6.4.1.7

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 244

Community Detection Based on Stochastic Block Models . . . . 246 6.4.2.1

Nested Stochastic Block Model . . . . . . . . . . . . . . 248

6.4.2.2

Model Inference and Minimum Description Length . . . 250

6.4.2.3

Keyword Extraction . . . . . . . . . . . . . . . . . . . . 252

6.4.2.4

Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252

6.4.2.5

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 277

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 6.5.1

Methods and Results . . . . . . . . . . . . . . . . . . . . . . . . . 278

6.5.2

Theoretical Implications . . . . . . . . . . . . . . . . . . . . . . . 280

6.6

Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283

6.7

Conclusion

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285

7 Concluding Discussion 7.1

290

Question 1: Networking Theory . . . . . . . . . . . . . . . . . . . . . . . 291 xii

7.2

Question 2: Applying Network Science . . . . . . . . . . . . . . . . . . . 294

7.3

Question 3: But Why? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302

7.4

7.3.1

Study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303

7.3.2

Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307

7.3.3

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311

Conclusion and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . 312

References

318

xiii

List of Figures 2.1

2016 Australian Twittersphere network. . . . . . . . . . . . . . . . . . .

2.2

Overview of six characteristic macroscopic network structures in Twitter communication networks . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3

21

23

Flow diagram of the quantitative classification of six macroscopic communication network structures . . . . . . . . . . . . . . . . . . . . . . .

24

2.4

Four stages of audience fragmentation . . . . . . . . . . . . . . . . . . .

38

2.5

The long tail distribution of audience size and engagement according to Bruns (2008, p. 70) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.6

39

An artistic depiction of the model of the public sphere by Bruns (2008, p. 71) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

2.7

Schematic of a news diffusion network . . . . . . . . . . . . . . . . . . .

44

2.8

Schematic of perceptions of the audience in a broadcast mass media environment before the emergence of networked mass media . . . . . . .

2.9

45

Schematic illustrating the differentiation between community and issue public . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

2.10 Schematic of a gatekeeping network . . . . . . . . . . . . . . . . . . . . .

49

2.11 Schematic of the gatewatching process . . . . . . . . . . . . . . . . . . .

50

2.12 Schematic of a network gatekeeper in form of a bridge between clusters

51

2.13 Schematic of the one-step and two-step flow hypothesis . . . . . . . . . .

53

2.14 The degree distribution in a power-law network compared to a random network degree distribution . . . . . . . . . . . . . . . . . . . . . . . . .

64

4.1

Research cycle guiding each test study . . . . . . . . . . . . . . . . . . . 108

4.2

Overview of the research design . . . . . . . . . . . . . . . . . . . . . . . 110

5.1

Depiction of a pure broadcast and structurally more viral cascade . . . . 125

xiv

5.2

Running sum of the number of collected tweets containing the hashtag sydneysiege . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

5.3

Running sum of the number of collected tweets containing the hashtag illridewithyou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

5.4

Running sum of new accounts using the hashtag sydneysiege

. . . . . . 131

5.5

Running sum of new accounts using the hashtag illridewithyou . . . . . 132

5.6

Running sum of collected tweets containing the link to the petition . . . 134

5.7

Running sum of collected tweets containing the link to the petition recorded during the week after 24 June 2016 . . . . . . . . . . . . . . . . 135

5.8

Running sum of collected tweets containing the link to the petition recorded on 24 June 2016. . . . . . . . . . . . . . . . . . . . . . . . . . . 135

5.9

Screenshot of share buttons on the petition website . . . . . . . . . . . . 136

5.10 New vs returning users using the hashtag sydneysiege . . . . . . . . . . 138 5.11 New vs returning users using the hashtag illridewithyou . . . . . . . . . 139 5.12 Moving average over 60 minutes of new vs returning users tweeting the link to the petition per minute. . . . . . . . . . . . . . . . . . . . . . . . 139 5.13 Depiction of Graph Database Schema for Twitter Data . . . . . . . . . . 143 5.14 Force-directed visualisation of the diffusion tree network of the hashtag illridewithyou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 5.15 Percentages covered by weakly-connected components of the diffusion tree network for the hashtag illridewithyou . . . . . . . . . . . . . . . . . 146 5.16 Force-directed visualisation of the diffusion tree network of the hashtag sydneysiege . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 5.17 Percentages covered by weakly-connected components of the diffusion tree network for the hashtag sydneysiege . . . . . . . . . . . . . . . . . . 147 5.18 Force-directed visualisation of the diffusion tree network of the petition link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5.19 Percentages covered by weakly- connected components of the diffusion tree network for the link to the petition . . . . . . . . . . . . . . . . . . 148 5.20 Visualisation of the diffusion network of the hashtag illridewithyou, coloured by the undirected harmonic closeness centrality . . . . . . . . . 153

xv

5.21 Visualisation of the diffusion network of the hashtag sydneysiege, coloured by the undirected harmonic closeness centrality . . . . . . . . . 154 5.22 Visualisation of the diffusion network of the petition link, coloured by the undirected harmonic closeness centrality . . . . . . . . . . . . . . . . 155 5.23 Distribution of harmonic closeness in the undirected diffusion tree of the hashtag illridewithyou . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 5.24 Distribution of harmonic closeness in the undirected diffusion tree of the hashtag sydneysiege . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 5.25 Distribution of harmonic closeness in the undirected diffusion tree of the link to the petition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.26 Visualisation of the diffusion network of the hashtag illridewithyou, coloured by the directed harmonic closeness centrality . . . . . . . . . . 158 5.27 Visualisation of the diffusion network of the hashtag sydneysiege, coloured by the directed harmonic closeness centrality . . . . . . . . . . 159 5.28 Visualisation of the diffusion network of the petition link, coloured by the directed harmonic closeness centrality . . . . . . . . . . . . . . . . . 160 5.29 Distribution of harmonic closeness in the directed diffusion tree of the hashtag illridewithyou . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 5.30 Distribution of harmonic closeness in the directed diffusion tree of the hashtag sydneysiege . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 5.31 Distribution of harmonic closeness in the directed diffusion tree of the link to the petition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 5.32 Histogram of shortest path lengths in the diffusion tree of the hashtag illridewithyou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 5.33 Histogram of shortest path lengths in the diffusion tree of the hashtag sydneysiege . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 5.34 Histogram of shortest path lengths in the diffusion tree of the link to the petition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 5.35 Scatter plot of the average shortest path length per component in relation to component size in the diffusion tree of the hashtag illridewithyou . . 167 5.36 Boxplot showing the distribution of the average shortest path length per connected component of the illridewithyou diffusion tree . . . . . . . . . 167 xvi

5.37 Scatter plot of the average shortest path length per component in relation to component size in the diffusion tree of the hashtag sydneysiege . . . . 168 5.38 Boxplot showing the distribution of the average shortest path length per connected component of the sydneysiege diffusion tree . . . . . . . . . . 168 5.39 Scatter plot of the average shortest path length per component in relation to component size in the diffusion tree of the petition link . . . . . . . . 169 5.40 Boxplot showing the distribution of the average shortest path length per connected component of the petition link diffusion tree . . . . . . . . . . 169 5.41 Example of a bimodal exposure network . . . . . . . . . . . . . . . . . . 173 5.42 Ranked time differences between exposures to the hashtag illridewithyou and the first own post by an account . . . . . . . . . . . . . . . . . . . . 174 5.43 Ranked time differences between exposures to the hashtag sydneysiege and the first own post by an account . . . . . . . . . . . . . . . . . . . . 174 5.44 Ranked time differences between exposures to the hashtag illridewithyou and the first own post by an account . . . . . . . . . . . . . . . . . . . . 175 5.45 Heatmap showing the number of accounts having used the hashtag illridewithyou, mapped according to their first and last possible exposure to the hashtag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 5.46 Heatmap showing the number of accounts having used the hashtag sydneysiege, mapped according to their first and last possible exposure to the hashtag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 5.47 Heatmap showing the number of accounts having tweeted the link to the petition, mapped according to their first and last possible exposure to the link, for the first 10 000 accounts only . . . . . . . . . . . . . . . . . 177 5.48 Heatmap showing the number of accounts having tweeted the link to the petition, mapped according to their first and last possible exposure to the link, for all accounts . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 5.49 Cumulative distribution of accounts relative to the maximum possible number of exposures to the hashtag illridewithyou before tweeting it themselves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

xvii

5.50 Cumulative distribution of accounts relative to the maximum possible number of exposures to the hashtag sydneysiege before tweeting it themselves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 5.51 Cumulative distribution of accounts relative to the maximum possible number of exposures to the link to the petition before tweeting it themselves, limited to the first 10 000 accounts . . . . . . . . . . . . . . . . . 181 5.52 Cumulative distribution of accounts relative to the maximum possible number of exposures to the link to the petition before tweeting it themselves, for all accounts in the dataset . . . . . . . . . . . . . . . . . . . . 182 5.53 Heatmap showing the correlation of number of friends (followings) with the number of exposures, in the case of the hashtag illridewithyou

. . . 183

5.54 Heatmap showing the correlation of number of friends (followings) with the number of exposures, in the case of the hashtag sydneysiege . . . . . 183 5.55 Heatmap showing the correlation of number of friends (followings) with the number of exposures in the case of the link to the petition, for the first 10 000 accounts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 5.56 Heatmap showing the correlation of number of friends (followings) with the number of exposures in the case of the link to the petition, for all accounts in the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 5.57 Cumulative distribution of accounts relative to the maximum possible number of exposures to the hashtag illridewithyou per followed account before the respective accounts tweet the hashtag, limited to the first 10 000 accounts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 5.58 Cumulative distribution of accounts relative to the maximum possible number of exposures to the hashtag sydneysiege per followed account before the respective accounts tweet the hashtag, limited to the first 10 000 accounts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 5.59 Cumulative distribution of accounts relative to the maximum possible number of exposures to the link to the petition per followed account before the respective accounts tweet the link, limited to the first 10 000 accounts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

xviii

5.60 Cumulative distribution of accounts relative to the maximum possible number of exposures to the link to the petition per followed account before the respective accounts tweet the link, for all accounts in the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 5.61 Distributions of the logarithm of exposures per friend before tweeting the hashtags illridewithyou and sydneysiege . . . . . . . . . . . . . . . . 188 5.62 Heatmap showing the distribution of infections (adoptions) by exposures per friend over time for the hashtag illridewithyou. Time in seconds of UNIX time (epoch). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 5.63 Heatmap showing the distribution of infections (adoptions) by exposures per friend over time for the hashtag sydneysiege. Time in seconds of UNIX time (epoch). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 5.64 Heatmap showing the distribution of infections by exposures per friend over time for the link to the petition . . . . . . . . . . . . . . . . . . . . 190 5.65 Visualisation of the influence network for the hashtag illridewithyou . . 192 5.66 Visualisation of the influence network for the hashtag sydneysiege . . . . 193 5.67 Visualisation of the influence network for the petition link . . . . . . . . 194 6.1

Map of Australian Twittersphere by Bruns et al. (2017) with Louvain Resolution 0.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

6.2

Map of Australian Twittersphere by Bruns et al. (2017) with Louvain Resolution 0.25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

6.3

Adjusted Rand index comparing the partitioning with the parallel Louvain method of the full network, with the partitioning with the serial Louvain method of the network filtered for accounts with degree over 1000 (as in Bruns et al. (2017)) . . . . . . . . . . . . . . . . . . . . . . . 220

6.4

Number of accounts in clusters labelled by Bruns et al. (2017) (# of accounts old, top), and the fraction of overlap with the overlapping cluster determined with the PLM method for the full follow network (bottom).

6.5

221

Linear and log-log plot of size distribution of communities detected by PLM algorithm, with γ = 2 . . . . . . . . . . . . . . . . . . . . . . . . . 222

xix

6.6

Distribution of k-core sizes relative to k for a suspected bot-network (community 12) and a typical cluster (community 13); numbers in brackets specify number of nodes in the respective k-core. . . . . . . . . . . . 223

6.7

Overview of posting activities of the community cores from 10 to 19 Feb 2017 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

6.8

Distribution of account activities for a suspected bot-network (12) and a typical cluster (13) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

6.9

Top 60 keywords for community-core 8 (pop culture) by χ2 score when compared to all other cores and fraction of accounts using them in all communities analysed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

6.10 Top 60 keywords for community-core 8 (pop culture) by χ2 score when compared to all other cores and fraction of accounts using them in all communities analysed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 6.11 Top 60 keywords for community-core 32 (hard right) by χ2 score when compared to all other cores and fraction of accounts using them in all communities analysed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 6.12 Top 60 keywords for community-core 5 (Australian politics) by χ2 score when compared to all other cores and fraction of accounts using them in all communities analysed. . . . . . . . . . . . . . . . . . . . . . . . . . . 235 6.13 Force-directed visualisation of the undirected community graph, filtered for an edge-weight of 10 000 connections . . . . . . . . . . . . . . . . . . 237 6.14 Force-directed visualisation of the undirected community graph, filtered for an edge-weight of 100 000 connections . . . . . . . . . . . . . . . . . 238 6.15 2016 Australian Twittersphere network . . . . . . . . . . . . . . . . . . . 239 6.16 Top 60 keywords for community-core 14 (agriculture) by χ2 score, compared to all other cores, and fraction of accounts using them in all communities analysed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 6.17 Visualisation of the adjacency matrix of the undirected community graph, as determined with the PLM algorithm, adapted for better discernibility of the connectivity of smaller communities . . . . . . . . . 242 6.18 Visualisation of the detected hierarchical block model structure in the Internet Movie Database (IMDb) actor-movie network by Peixoto (2014b)247 xx

6.19 Example of a network generated with stochastic block model . . . . . . 248 6.20 Example of a nested SBM . . . . . . . . . . . . . . . . . . . . . . . . . . 249 6.21 Visualisation of the nested stochastic block model inferred for the filtered Australian follow network, edges sampled down to 0.1 percent . . . . . . 254 6.22 Annotated visualisation of the nested stochastic block model inferred for the filtered Australian follow network, edges sampled down to 1000 edges 255 6.23 Visualisation of the adjacency matrix of the directed block graph of the filtered Australian follow network on level 5, as inferred based on nested stochastic block models . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 6.24 Treemap of blocks on levels 4 to 5, as inferred based on nested stochastic block models in the filtered Australian follow network . . . . . . . . . . 258 6.25 Visualisation of the adjacency matrix of the directed block graph of the filtered Australian follow network on level 4, as inferred based on nested stochastic block models . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 6.26 Treemap of blocks on levels 3 to 5, as inferred based on nested stochastic block models in the filtered Australian follow network . . . . . . . . . . 261 6.27 Visualisation of the adjacency matrix of the directed block graph of the filtered Australian follow network on level 3, as inferred based on nested stochastic block models . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 6.28 Force-directed visualisation of the directed block-network of the Australian Twittersphere on level 3 of the inferred nested stochastic block model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 6.29 Treemap of blocks on levels 2 to 5, as inferred based on nested stochastic block models in the filtered Australian follow network . . . . . . . . . . 264 6.30 Detail of blocks at level 2 in parent block 0/0/6 (Australian Politics) . . 265 6.31 Visualisation of the adjacency matrix of the directed block graph of the filtered Australian follow network on level 2, as inferred based on nested stochastic block models . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 6.32 Annotated, force-directed visualisation of the directed block-network of the Australian Twittersphere on level 2 of the inferred nested stochastic block model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267

xxi

6.33 Detail of blocks at level 2 in parent block 0/1 (exhibiting peripheral blocks and topics) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 6.34 Treemap of blocks on levels 1 to 5, as inferred based on nested stochastic block models in the filtered Australian follow network . . . . . . . . . . 269 6.35 Image section of blocks on level 1 in parent block 0/0/2/2 (sports) . . . 271 6.36 Detail of blocks at level 1 in parent block 0/0/6/23 (Australian politics) 272 6.37 Annotated visualisation of the adjacency matrix of the directed block graph of the filtered Australian follow network on level 1, as inferred based on nested stochastic block models . . . . . . . . . . . . . . . . . . 273 6.38 Treemap of blocks on levels 0 to 5, as inferred based on nested stochastic block models in the filtered Australian follow network . . . . . . . . . . 274 6.39 Detail of blocks at level 0 in block 0/0/2/2 (sports) . . . . . . . . . . . . 276 6.40 Visualisation of the adjacency matrix of the directed block graph of the filtered Australian follow network on level 0, as inferred based on nested stochastic block models . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 7.1

Schematic of a pragmatic methodological framework for interdisciplinary research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295

xxii

List of Tables 5.1

Component centralisations Γ and Γ2 for the diffusion trees of illridewithyou, sydneysiege, and the petition link . . . . . . . . . . . . . . . . . . . 150

5.2

Structural virality (ignoring impossible paths), average degree, diameter, and number of connected components of the undirected diffusion tree network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

5.3

Selected network measures of the influence network for all three cases . 194

5.4

Summary of results regarding the diffusion of the hashtags and the petition link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

6.1

Top 60 Keywords by chi-squared score in English writing communities with over 1000 accounts, filtered for usage by at least 5% of the 10%-core 228

6.2

Number of blocks per level in the nested stochastic block model inferred for the filtered Australian follow network (254,585 nodes) . . . . . . . . 253

6.3

Top 30 Keywords by chi-squared score in blocks at level 5, filtered for usage by at least 5% of active accounts (Thai script omitted) . . . . . . 256

xxiii

Abbreviations AEDT Australian Eastern Daylight Time AFL Australian Football League ANT Actor Network Theory AoIR Association of Internet Researchers API Application Programming Interface BST British Summer Time DMRC Digital Media Research Centre EU European Union IaaS Infrastructure as a Service IMDb Internet Movie Database IRC Internet Relay Chat MDL Minimum Description Length Nectar National eResearch Collaboration Tools and Resources project NRL National Rugby League OII Oxford Internet Institute PLM Parallel Louvain Method QUT Queensland University of Technology SBM stochastic block model xxiv

SMRG Social Media Research Group DMI-TCAT Digital Methods Initiative Twitter Capture and Analysis Toolset TrISMA Tracking Infrastructure for Social Media Analysis UK United Kingdom URL Uniform Resource Locator USA United States of America UTC Coordinated Universal Time

xxv

Statement of original authorship The work contained in this thesis has not been previously submitted to meet requirements for an award at this or any other higher education institution. To the best of my knowledge and belief, the thesis contains no material previously published or written by another person except where due reference is made.

QUT Verified Signature

January 2019

xxvi

Chapter 1

Introduction Before I started this project in 2014, the Internet and social media still seemed to me to be a ‘Robin Hood’ story: taking from the powerful establishment, and disrupting structures in favour of the marginalised. This feeling of empowerment was driven by phenomena such as Wikileaks, the Arab Spring, blogs, and social media. These alternative media kept the established media in check, if the latter failed to control those in power. Admittedly, this feeling was also biased, as it was based on my personal experience: the successful use of social media as an important tool in organising student protests against the introduction of student fees in Bavaria (my home state in Germany) between 2007 and 2010. Between 2014 and 2018, during the course of this project, however, this ‘Robin Hood’ picture changed drastically: Brexit was characterised by polarised online debate where (seemingly) nobody listened to anyone else; Donald Trump was carried into the White House by a hard-right movement that organised itself mainly online; hard-right, sometimes openly racist parties were successfully elected into parliaments throughout Europe; and hate-speech, due to the policy discussions it fosters, now threatens freedom of expression. In media and academic discourse, this new context sparked discussions about concepts such as fake news, filter bubbles, echo chambers, the affordances of social media platforms, and their failure to govern themselves in a way that prevents the diffusion of verbal and visual abuse of vulnerable groups in our society. During these discussions, I regretted that this project was not yet finished as, from the network sci-

1

2

CHAPTER 1. INTRODUCTION

ence perspective, I had the impression that these discussions were actually often about complex network structures and dynamics. Too often, however, as a community of media and communication scholars, we lack the vocabulary and the methods necessary to analyse and discuss the networks relevant to these discussions at the scale and complexity needed; that is, at the scale and complexity of society itself, and at the scale and complexity of a networked public sphere. While I was exploring network science to find these necessary methods and this necessary vocabulary, I noticed that it was hard to talk in the same language that media and communication scholars usually use. This raised an important question: Why? Why are we not using the vocabulary of network science? Before we can sustainably and successfully merge network science methods with the theory and discourse of media and communication studies, we have to find better ways to connect the different cultures in all participating fields. Therefore, the dual aim of this project is: to help to build a bridge between the research cultures involved and, at the same time, to provide knowledge, theory, and methods that are mutually valuable and useful.

1.1

Background, Project Goals, and Relevance

As the result of online communication, mainstream mass media are losing their structural advantages. Non-media organisations, brands, celebrities, politicians, and ordinary users are now empowered to publish their own content, to build their own spheres, and to decide where, and how, information flows. In these times of network-based, realtime, many-to-many communication, models of the public sphere should reflect the rise of flexible public, semi-public, and private sub-spheres. While there are hitherto unthinkable amounts of accessible data (which, I acknowledge, come with their own complications) through which to examine these spheres, social sciences and communication studies have not yet adequately adopted these new analytical possibilities. This holds especially true for network analysis methods. To address this deficit, this project approached public communication from a network science perspective, with a focus on the Australian Twittersphere. Obviously, Twitter does not represent the public sphere; however, as I later elaborate, it is a flexible, complex, real-world example of information dissemination in human communication

1.1. BACKGROUND, PROJECT GOALS, AND RELEVANCE

3

networks. At the same time, the project takes further steps towards a theoretical and methodological framework to find a successor – a networked public sphere model – to replace the long-dominant model of a homogeneous public sphere where, simply put, public communication consists merely of a mediated discourse among elites in front of a (more-or-less) non-responsive, rather silent audience (Habermas, 1962). Thereby, the project focuses primarily on a methodological and theoretical outcome, rather than on a research question in the media and communication studies field. Research questions, established theories, and theory development are, however, used to validate the usefulness of the methods developed. To this end, the project follows a multi-phase, mixed-method design. In this design, methods are not only considered as tools to investigate specific research questions, but also as having a life of their own. Thus, a driving idea behind this work is that, while research is often motivated by specific questions, the processes of methodology development, exploration, and testing, can themselves raise new questions. Thus, while taking specific research questions as its guides, this research is undertaken on the understanding that, along the way, in an iterative process of trial and error, the important problems will reveal themselves, and methods discovered in this process will raise (and answer) questions that probably would not have been previously considered. The instrument explored in this thesis is network science – a discipline that I expect will become one of the most important auxiliary disciplines for media and communication studies in the near future. Before the emergence of the Internet as a common way of communicating, mass-media communication networks seemed simple. They were typically star-shaped: A sender – perhaps a newspaper publisher, a TV or radio station, an activist group, or a publisher of scientific journals – sent out something to a crowd of recipients, who did not directly respond (see, e.g., McQuail, 2010, pp. 540–541). Interconnections among these recipients did exist, and were referred to, for example, as ‘social formations’ (Fiske, 1992). However, despite Katz and Lazarsfeld’s 1940s’ pioneering research (Katz & Lazarsfeld, 1955; Lazarsfeld, Berelson, & Gaudet, 1944), which did undertake an early kind of network analysis, these interconnections were hidden to quantitative research on a bigger scale due to the lack of data or the means to collect them. So, it is understandable that simple aggregate statistics about recipients became the standard for mass-media measures, at least from the perspective

4

CHAPTER 1. INTRODUCTION

of traditional mainstream media outlets and the public opinion industry (see Dahlgren, 2005, p. 149); they were sufficient enough to describe the simple network structure. There were social and information networks among both recipients and senders, and these were discussed by scholars already before the 1990s. “The network perspective has proved fruitful in a wide range of social and behavioral science disciplines. Many topics that have traditionally interested social scientists can be thought of in relational or social network analytic terms.” (Wasserman & Faust, 1994, p. 6). However, in mass media the relevant networks between members of society were barely visible, and only quantifiable with great (often unjustifiable) effort. The Internet made the visible media networks more complex. The emerging “scope of the Internet seems to challenge virtually every element of the mass communication ‘ideal type’ ” (McQuail, 2010, p. 542). Mass communication is “not centre-periphal in form” anymore, “but networked” (McQuail, 2010, p. 542). Through the observable hyperlink network, the network among the information sources was made manifest. Online social media platforms merged recipients’ social networks into media distribution networks. Thereby, recipients became senders themselves. So, it “is no longer possible to characterize the dominant ‘direction’ or bias of influence of information flows (as with press and television news and comment)” (McQuail, 2010, p. 141). Therefore, network concepts and methods became important for media studies, as media and communication scholars were naturally interested in the emergence of those sometimes semi-closed, yet interconnected public spaces. Research on those interconnections was, and is, hindered by three factors, however. First, more complicated interconnections led to multi-correlated and non-linear effects, which are difficult, if not impossible, to model with media and communication scholars’ standard methods. Second, the existence of hitherto unthinkable amounts of data led to a pressure on researchers to use them. They were not used to working with this kind of data and, due to epistemological differences that this study later addresses, this led to a resistance to doing so. Third, the (mostly) privately-owned platforms noticed the monetary worth of their data. For example, after a year-long period of openness, Twitter narrowed its API accessibility and built a commercial barrier that disrupted a thriving community of developers and researchers (see Burgess & Bruns, 2012).

1.2. ON INTERDISCIPLINARITY

5

Nevertheless, there is a “strong case for revision of theory” (McQuail, 2010, p. 157). Taking the new interactivity of mass media communication into account: One of the tasks of future theorizing will be to adequately map this enlarging field and to develop appropriate typologies that will allow us to escape from the limitations of a rather worn-out conceptual apparatus. (McQuail, 2010, p. 547) As this thesis argues, network science provides the most promising ways of addressing this theoretical task.

1.2 On Interdisciplinarity Write a natural science introduction for a social science audience and your paper will be rejected before the reviewer sees the results section. Write a social science introduction for a natural science audience and you will be scoffed away for being ‘unnecessarily verbose’. (Hidalgo, 2016, p. 14) This project is inherently interdisciplinary. Not only does it navigate between media and communication studies and network science, network science itself is a discipline that has emerged in the last two decades from a number of different fields (Barabási, 2015). As I see the disciplinary divides between the fields involved as a problem that should be addressed, this thesis is written for an interdisciplinary audience. It aims to be useful for, and understood by both media and communication scholars, and the various researchers profiting from, and contributing to network science. These include natural scientists, social scientists, and economists, for example. Therefore, as far as is reasonable, and to find a middle ground, I aim for a comprehensive, yet not overly detailed, content coverage. This dual audience also necessitates an extensive literature review, as one cannot assume that both domains have the same knowledge. Also, a basic explanation of different epistemologies, and the conflicts they cause in the media and communication and social science fields, is necessary so that natural scientists can gain an understanding of issues that are fundamental to the project. Sometimes, this also involves the explanation of definitions and concepts in a way that might seem overly verbose to

6

CHAPTER 1. INTRODUCTION

a reader who is more acquainted with more technical, mathematical, or quantitative research. Besides knowledge, as the Hidalgo (2016, p. 14) quote illustrates, there are also differences in style. For comprehension’s sake, if it is not absolutely necessary to make the point at hand clear, I sometimes sacrifice a precise definition in the language of formal logic and maths that natural scientists are used to and prefer. Furthermore, it is sometimes necessary to accept a certain degree of unavoidable vagueness when working with questions regarding discursive and associatively defined concepts. In such cases, language becomes more than simply a provider of variable names, and some redundant explanation is needed to (at least) approximate a definition with the precision needed for the scope discussed. On the other hand, some conventions and practices from the natural sciences are necessary to achieve and communicate the results of this project. These do not only include some advanced mathematics, but also the extensive use of figures – such as probability distributions, heatmaps, or network visualisations – as support for the arguments made. These figures might, at times, take some time to comprehend, but are crucial to an understanding of the arguments made. This leads to yet another difference in styles between the disciplines: Social scientists often think of graphical statistical methods as “non-serious,” since they are limited in their ability to control for co-founding factors, while natural scientists find that the use of tables, instead of graphical representation of results, occludes information about functional forms, which natural scientists consider important. (Hidalgo, 2016, p. 14) As networks do not usually exhibit the distributions necessary to apply the usual multivariate inferential statistics, or even common descriptive statistics, I mostly fall back on graphical statistical methods rather than tables, and trust that the advantages of this approach are clear. In this spirit of interdisciplinarity, the definition of ‘network’ (the first definition of this thesis) is as follows: A graph [or network] is a collection of points, called vertices (nodes in the physics literature or actors in the social sciences). These points are

1.3. CHAPTER STRUCTURE AND SUMMARY

7

joined by a set of connections, called edges, links, and ties, in mathematics, physics, and social sciences, respectively. (Pastor-Satorras, Castellano, Van Mieghem, & Vespignani, 2015, p. 932) As we can see from this definition, the vocabulary for networks, or graphs, differs for the same concepts depending on the field, or even the author. For the most part, I use these different terms interchangeably throughout this thesis. However, where terms have slightly different implications – for example, when the term ‘actor’ implies some form of agency – I sometimes preference one term over the other. In summary, it is my intention that the overall style of this thesis increases both its interdisciplinarity and its potential interdisciplinary readership, and thereby decreases the likelihood of its being seen as relevant to one of the related disciplines only.

1.3 Chapter Structure and Summary Following this introduction, in chapter 2, I lay the basis for the thesis argument that network science should be considered as one of the most important, useful, and necessary auxiliary disciplines in media and communication studies. Starting with the so-called ‘fake news’ and the concepts of filter bubbles and echo chambers as examples of recent challenges in public communication, I postulate the necessity for emerging models of a networked public sphere (sec. 2.1). Social media data are a natural source of evidence for these models; therefore, I review a selection of studies that focus on the analysis of network structures on Twitter and Facebook (sec. 2.2). A problem is immediately evident: This corpus of work appears to be scattered among disciplines, without the integrating guidance of a dominating discourse or common goal. Reasons for this can be found in the research challenges for media and communication studies: the organised complexity of the problems themselves; technical difficulties in handling the relevant data; and a historically grown disciplinary divide, not only between media and communication studies and the natural and computational sciences necessary for network science, but also within network science itself (sec. 2.3). This identified problem determines the two areas chosen for further literature review: media and communication studies that are easily connected with network science; and

8

CHAPTER 1. INTRODUCTION

the relevant network science literature. The first area relates to media and communication studies that are easily connected with network science (sec. 2.4). The literature review of this area serves two purposes: 1) It gives readers outside this discipline the background necessary to understand the following chapters; and 2) it contains parts of the answer to the first guiding question of this research (discussed in detail in the next chapter). With the help of these examples, I am able to demonstrate ways to interpret established theories, concepts, and hypotheses central to media and communication studies, in the often visual vocabulary of networks. Following this, I give an overview of the relevant network science literature (sec. 2.5). However, this review goes further than what is necessary to understand the empirical studies in the later chapters. This is because my goal here is to also to give a reader who is new to the field (at least) both a broad overview of network science, and some guidelines on how to apply it in their own research. Overall, this literature review shows how promising this undertaking is in light of the existing research. However, it also indicates the complex barriers between the disciplines. These barriers hinder the sufficient engagement of both bodies of literature that is necessary to build a theory of the networked public sphere based on empirical evidence. Drawing on this background, chapter 3 outlines and explains the objectives and guiding questions of this project: How to translate media and communications theory about mass communication into the language of networks; how to apply network science methods to work with this theory; and whether, indeed, it is worth the effort. In the literature review, and throughout this project, I identify methodological problems as one of the main reasons why network science is having a slower take-up in media and communication studies than it should. It is important to note that I use the term ‘methodological’ throughout this thesis, not only to refer to methods themselves, but also to their bases: mainly epistemology and teleology or, simply, the worldview of researchers, and how this view affects their research practice. In chapter 4, I first outline the main epistemological conflicts relevant to this project, explain how the so-called ‘computational turn’ in the humanities creates the need for a solution, or at least a truce, in these conflicts. This leads to a central

1.3. CHAPTER STRUCTURE AND SUMMARY

9

outcome of this thesis: a methodological framework, based on the concept of patterns, that resolves these conflicts (at least within the scope of this thesis); and the argument that pragmatism is the epistemological standpoint necessary to making further progress towards this project’s goals (sec. 4.1). In sec. 4.2, I use this methodological framework to explain a research cycle that abstracts the multi-phase mixed methods design of the empirical part of this thesis. This design integrates quantitative and qualitative methods within the same methodology and epistemology; it questions whether the distinction makes sense; and introduces another perspective that positions methods on a spectrum between more rule-based approaches, and methods that are more responsive to observations during the research process. Putting this into practice, chapter 5 is the first empirical study of this project, and focuses on the diffusion of information. It reports on a study of the spread of two hashtags and a Uniform Resource Locator (URL) (i.e., web address) on Twitter following two acute events: a terrorist attack in Sydney (in 2014), and the Brexit referendum (in 2016). This mainly leads to new insights into, and possibilities for assessing and understanding, the virality of content. Furthermore, it proposes ways to further develop and test existing theory and hypotheses related to information diffusion – such as gatekeeping; one-, two-, or multi-step-flows; or opinion leaders – and to couple them with concepts of virality and contagion. Following the insight that to fully understand the diffusion of information, an understanding of the longer-term, underlying communication structures is needed, chapter 6 addresses concepts such as audiences, publics, issue publics, public sphericules, communities, filter bubbles, and echo chambers from a network science perspective. To do so, it applies two fundamentally different community detection algorithms to the network of follow connections in the Australian Twittersphere. This is then followed by an automated keyword extraction from the tweets of the accounts analysed. This does not only provide new, detailed evidence about macro- and microstructures in the Australian Twittersphere, but also allows, and shows the necessity of, a reflection on the definitions of ‘community’ that underlie network-based community detection algorithms, and on the epistemological implications of choosing one over the other. With a focus on the objectives of this project, insights from this reflection lead

10

CHAPTER 1. INTRODUCTION

to an overarching discussion in chapter 7. This discussion synthesises the findings from the literature review, the methodological framework, and the findings and insights from both empirical studies. I explain how this project builds a strong case for the interpretation of much media and communication theory as theory of network structures and dynamics. Furthermore, I abstract from the methodological framework and the two empirical studies a summary of what is necessary to make further progress in using network science to validate, reject, develop, or extend theories about a networked public sphere. Using the results and theoretical implications drawn from both empirical studies, the last part of the discussion argues that network science is not only a promising and useful discipline, but also a necessary auxiliary discipline for media and communication studies if they are to understand the current media environment and its social effects. In the concluding sec. 7.4, I reflect on the project and discuss the reasons for the discipline-bridging character and relevance of network science for media and communication studies. I also make recommendations for ensuring the further integration of network science into the building of media and communication theory about a networked public sphere. In sum, this thesis makes a strong argument for the more prominent role of network science in media and communication studies to engage in what I would call ‘communication topology’. It provides new empirical evidence about information diffusion and communication structures on Twitter at a national level. Finally, it explores, tests, and combines methods for procedures that are useful in employing big social data for the building of theory about a networked public sphere.

Chapter 2

Literature Review The Internet, and especially the success of social media, has arguably dominated changes to the structure of the public sphere in the last three decades. However, while the internet and social media were mainly seen as great facilitators during this time, the challenges that these heavily networked media systems bring with them have recently become more visible. Society now faces a complexity in public communication that is difficult to manage. This complexity is not only rooted in the number of voices that can now be heard, but also in the seemingly endless possibilities for broadcasting and reception. To complicate matters, the access to these channels, and the data needed to understand them, is mostly controlled by organisations for which public communication is not necessarily only a necessary ingredient for a functioning society; it is also an asset that can be monetised by controlling access to its participants’ attention, and the means to measure their response. In this literature review, I first illustrate some of these challenges to show the necessity for theories about structures in a networked public sphere. Then, in sec. 2.2, I provide examples of research dealing with social media data that addresses questions relevant to such theory. These examples reveal the usefulness of insights, methods, and vocabulary from network science in answering these questions; however, they also give insights into the shortcomings of the research to date. These shortcomings are based on particular research challenges, some of which I will summarise in sec. 2.3. One of these challenges is a disciplinary divide in the research

11

12

CHAPTER 2. LITERATURE REVIEW

cultures of the fields relevant to this kind of study. This divide hinders a critical and theory-based approach to the methods of network science. At the same time, it also slows down evidence- and data-driven approaches to theory about a networked public sphere in media and communication studies. Thus, a further review, with a focus on relevant approaches from both disciplines, is necessary. This section of the literature review is divided into two categories: a theoretical, more humanities-centred body of work; and research in, or closely related to network science. Even though the classification into theoretical and method-oriented literature is often as much an over-simplification as is classifying the literature into media and communication studies and network science, it is still helpful in providing an orientation for this sprawling interdisciplinary body of work. In the first part (sec. 2.4), in an effort to draw a brief history of the evolution and perception of public communication during the last 100 years, I overview the relevant literature from social science, media and communication, and cultural studies. In the process, I illustrate how fundamental hypotheses and theories of media and communication studies also show the potential to be operationalised as theories of network structures. In the second section (sec. 2.5), I map the field of network science which, in itself, covers at least four disciplinary approaches (Ackland, 2013, p. 13): network science from the field of applied physics; network science from a more sociological perspective; approaches from information and computer science; and attempts by media and communication scholars to introduce this field into their research. Taken together, this review will allow a reader (from any of these disciplines) to follow the motivation for, and the objectives, analysis, and discussion of, this project’s empirical case studies.

2.1

Recent Challenges in Public Communication

The election of Donald Trump as President of the United States, the successful referendum on the United Kingdom leaving the European Union, and the success of hard-right movements and parties in western countries, brought two theoretically predicted phenomena to the fore: a failure of the media system to filter faulty information, leading

2.1. RECENT CHALLENGES IN PUBLIC COMMUNICATION

13

to a surge in the distribution of so-called ‘fake news’ (which is addressed in sec. 2.1.1); and, related to this phenomenon, the constructs of filter bubbles or echo chambers (the topic of sec. 2.1.2). While these concepts are nothing new to researchers who are investigating the effects of the internet on public communication, their impact on political publics seems to have reached a level that now gives them unprecedented attention. Both phenomena are related to concepts as equally important to media and communication studies as they are to network science. They raise questions regarding the dynamics and structures of communication networks, or of the public sphere itself; this, in turn, leads to the necessity of developing an evidence-based theory about a contemporary networked public sphere. This is later proposed in sec. 2.1.3.

2.1.1 Fake News Fake news is now scarcely a neglected topic; however, it is a good example of the importance of understanding the structure of and dynamics of a networked public sphere. Organisations as diverse as the Vatican (Holy See Press Office, 2017) and NATO (Bertolin et al., 2017) are addressing the issue; research that seeks to find frameworks to define, analyse, and counteract the phenomenon is commissioned and funded by the European Research Council (Guess, Nyhan, & Reifler, 2018) and the Council of Europe (Wardle & Derakhshan, 2017); and the Reuters Institute recently published two major fact-sheets regarding audience perspectives on, and the reach of fake news in the wake of the US elections (Fletcher, Cornia, Graves, & Nielsen, 2018; Nielsen & Graves, 2017). These are but a few examples of the studies related to understanding the phenomenon of fake news. While much of this research deals with the issues of defining, categorising, and counteracting fake news, Bounegru, Gray, Venturini, & Mauri (2018) express the important notion: that fake news is not just another type of content that circulates online, but that it is precisely the character of [its] online circulation and reception that makes something into fake news. In this sense fake news may be considered not just in terms of the form or content of the message, but also in terms of the mediating infrastructures, platforms and participatory cultures which

14

CHAPTER 2. LITERATURE REVIEW facilitate its circulation. (Bounegru et al., 2018, p. 8) On a more general level, this relates fake news closely to news and information

diffusion research; to related theories in media and communication studies such as opinion leadership and gatekeeping; and to network science methods that deal with the analysis of the cascading spread of information, behaviour, and emotion on social media. These concepts are addressed in the first empirical study of this thesis in chapter 5. The complex interaction of long-term communication structures, their ephemeral dynamics, the individual properties of actors, and the specific features of the news shared, however, mean that clear reasons for the sudden (and, perhaps, only perceived) importance of fake news, are impossible to find. In fact, this complex interaction leads to seemingly contradicting research about the question of whether network structures, the individual properties of audience members, or the features of the news itself are the main ‘culprit’. Vosoughi, Roy, & Aral (2018), for example, in a large scale study of fact-checked news shared on Twitter, found that users were more likely to share news that was later debunked, than news that could be verified. Their results show that this could be mainly related to the (generally) higher (faked) novelty of made-up news, and the higher likelihood that such news will provoke shares. At the same time, they found that users “who spread false news had significantly fewer followers” and “followed significantly fewer people”. Therefore, they conclude that structural elements do not explain “why falsity travels with greater velocity than the truth”. However, these structural measures, and their interpretations, might be too simple. For example, the first study for this thesis in chapter 5 suggests that more followed accounts can also lead to less available attention and, thereby, hinder the spread of news. Furthermore, with the lower number of followings and followers, Vosoughi et al. (2018) have actually found significant structural differences. Their finding simply contradicts the expectation that more followers and followees improve the spread of information. The importance of network structure is highlighted even more by a number of studies that use simulation models on networks to investigate information diffusion. For example, Qiu, Oliveira, Shirazi, Flammini, & Menczer (2017) present a simplified, though plausible, model for the spread of information with varying quality on a net-

2.1. RECENT CHALLENGES IN PUBLIC COMMUNICATION

15

work. They can show that, depending on the attention capacities of the actors and the information load (i.e., the frequency of new information entering the network), the network fails to be a functioning marketplace of ideas: information with low quality “is just as likely to go viral” (Qiu et al., 2017) as high quality information. However, their model only explores this effect on an artificial kind of network, a so-called ‘scale-free network’ (see sec. 2.5.2). Therefore, they suggest that “the model could be further expanded to capture other characteristics derived from empirical social networks, such as the segregated communities that we typically observe around discussions of polarizing topics”. This is where the concepts of filter bubbles and echo chambers or, more generally speaking, long-term communication structures, become relevant.

2.1.2 Filter Bubbles and Echo Chambers Online media, and especially social media, lead to complex, long-lasting communication structures which, modelled as a network, often exhibit clusters of higher density that are less densely connected. This is confirmed to be the case for the Australian Twittersphere in the second empirical study of this project (chapter 6). With regard to fake news and other content, these kinds of structures can have a fundamental impact on the virality and dynamics of message spreading, especially in models of so-called ‘complex contagion’ (see sec. 2.5.4.1). Accordingly, the first empirical study shows differences in the underlying community structures – differences that are in line with qualitative observations of their spreading patterns and the underlying nature of the diffusing message (see sec. 5.4.3.4). This observation of dense clusters in online networks is in line with political communication and political psychology theories and research, where it is seen as a result of attraction to similar others (homophily) and the selective consumption and sharing of news that supports existing views (Boutyline & Willer, 2017, p. 551). While “political theorists argue that dialogue across lines of political of political difference is a prerequisite for sustaining a democratic citizenry” (Boutyline & Willer, 2017, p. 551), these clusters can also have a positive effect as they “reinforce behavioral norms and increase social pressure”; in this way, they can be beneficial for activities important for

16

CHAPTER 2. LITERATURE REVIEW

a democracy, such as protests or voter turnout (Boutyline & Willer, 2017, p. 552). With the help of network analysis, Boutyline & Willer (2017, p. 565) confirm that, on Twitter, there is stronger political homophily on the right of the political spectrum than on the left; meanwhile, accounts at the ideological extremes of the spectrum are, in general, more homophilous in their follow-behaviour than those in the centre. The first of these findings is confirmed in the second empirical study of this project in chapter 6, where the hard-right cluster in the Australian Twittersphere appears more segregated from the overall Twittersphere than a cluster of leftist-progressives. While this segregation could be amplified by social media, empirical studies show that this is not the case. Based on the browsing histories of a large sample of US citizens, there is evidence for the widespread belief that, in the case of accessing political news, “articles found via social media or web-search engines are indeed associated with higher ideological segregation than those an individual reads by directly visiting news sites”; at the same time, “these channels are associated with greater exposure to opposing perspectives” (Flaxman, Goel, & Rao, 2016, p. 318). Findings in a European context by Vaccari et al. (2016, p. 8) suggest the same mixed results: German and Italian Twitter users who communicate about elections are more likely to do so in networks that support rather than challenge their views, consistent with the notion that social media facilitates the emergence of echo chambers. At the same time, contrarian clubs, which involve frequent encounters with dissonant opinions— whether in oppositional or mixed networks—are less exceptional than expected. These ambivalent findings on the impact of social media on ideological polarisation is not new. It echoes a case study finding by Yardi & boyd (2010) regarding Twitter discussions around the shooting of a late-term US abortion doctor. The authors “see both homophily and heterogeneity in conversations about abortion. People were more likely to interact with others who share the same views as they do, but they are actively engaged with those with whom they disagree” (Yardi & boyd, 2010, p. 325). Therefore, they conclude that their “results suggest that the wide range of interactions that we observed on Twitter may promote positive social outcomes” (Yardi

2.1. RECENT CHALLENGES IN PUBLIC COMMUNICATION

17

& boyd, 2010, p. 325). This is despite the ease it provides for people to interact with like-minded citizens only. Most of the recent assessments of the prevalence of filter bubbles and/or echo chambers suffer from unclear and differing definitions of the terms themselves. Bruns (2017b) proposes a definition that not only provides operationalisability in network terms, but also distinguishes both concepts: 1. An echo chamber comes into being where a group of participants choose to preferentially connect with each other, to the exclusion of outsiders. The more fully formed this network is (that is, the more connections are created within the group, and the more connections with outsiders are severed), the more isolated from the introduction of outside views is the group, while the views of its members are able to circulate widely within it. 2. A filter bubble emerges when a group of participants, independent of the underlying network structures of their connections with others, choose to preferentially communicate with each other, to the exclusion of outsiders. The more consistently they adhere to such practices, the more likely it is that participants’ own views and information will circulate amongst group members, rather than information introduced from the outside. (Bruns, 2017b, p. 3) This thesis adopts these definitions. In any case, both constructs are intertwined with the phenomenon of fake news: if a ‘false’ belief establishes itself within an echo chamber or filter bubble, it is likely to persist, even in the face of contradictory outside evidence.

2.1.3 The Networked Public Sphere While this thesis does not explicitly focus on investigating the specific phenomena of filter bubbles, echo chambers, and fake news, it explores methods and theory that promise to help our understanding of these and similar constructs that emerged with the advent of the internet and social media. The attention recently given to the danger of filter

18

CHAPTER 2. LITERATURE REVIEW

bubbles, echo chambers, and fake news underlines the importance of understanding the structures and dynamics of the current heavily networked public sphere. In the case of filter bubbles and echo chambers, the conflicting evidence suggests that we are either not asking the right questions (because of a lack of truly useful theory), or are using the wrong methods to answer them. This gives us cause to take a step back so as to draw a more holistic and more precise picture of the public sphere today. This picture needs to be informed not only by traditional media and communication studies theory, but also by methods that are still new to mass communication theory – especially those from the field of network science. Both fields can be used as sources for the development of the basics of a theory about a networked public sphere, as is i.a. done by Bruns (2008), and further developed by Bruns & Highfield (2016). While, at this point, I only outline their basic idea, a more thorough dealing of their and of other relevant media and communication theory follows in sec. 2.4. In short, Bruns & Highfield (2016) propose a more nuanced picture of the structure of the public sphere: It is neither simply a collection of bubbles in an isolated void, nor a mesh where everyone is connected to everything. To them, it “seems obvious that the central feature of such a new model must be the fragmentation of the unified public sphere into a range of diverging yet potentially overlapping publics” (Bruns & Highfield, 2016, p. 5). They agree that these “lower-order publics are likely to be increasingly more reliant on specialist and niche media, in keeping with their own much more narrowly defined interests” (Bruns & Highfield, 2016, p. 7). But while the ‘filter bubble’ metaphor suggests that such bubbles are each hermetically sealed from one another, observable reality appears to point to a much greater degree of interpenetration through shared connections and information flows … (Bruns & Highfield, 2016, p. 9) To provide evidence for their model of interconnected domain publics, technologically driven public spheres, public sphericules, and issue publics, they argue that social media communication structures can be seen as (possibly distorted, but still valid) reflections of public sphere constructs (Bruns & Highfield, 2016, p. 10). This allows them to use their dataset of follow connections in the Australian Twittersphere to illustrate, with network visualisations, the existence of densely connected clusters of ac-

2.2. SOCIAL MEDIA AS A SOURCE OF EVIDENCE

19

counts that are associated with certain topics. However, they can also show that there are issues that connect larger areas of the overall Twittersphere (Bruns & Highfield, 2016, pp. 16–17). This observation, together with their model of overlapping lowerlevel publics within a public sphere, “suggests that fragmentation does not necessarily beget isolation or complete separation” (Bruns & Highfield, 2016, p. 17). To investigate this suggestion more precisely, we need a more gradual, multidimensional theory of different forms of publics, based on the structures and dynamics of the overall public sphere. While making further steps in this direction, this thesis argues that a great deal of vocabulary necessary to formulate and empirically test such theory can be found in network science. Meanwhile, when dealing with social media analytics, network science should pay more attention to media and communication theorists.

2.2 Social Media as a Source of Evidence Bruns & Highfield (2016) are not the only ones to see social media data as a rich source of evidence for structures and dynamics in public communication. Many others also see it (at least) as an excellent playing field for researchers to investigate ‘online social interactions’. Facebook and Twitter are two of the most used social media platforms today. Despite having less active users, Twitter appears to be the object of more academic research. This is explainable by greater data availability and fewer ethical concerns: Twitter accounts are either public (by default), or completely private (as actively chosen by the user). Hence, it is to be expected that Twitter users are more aware of the public availability of their data. However, this still does not justify the use of their data in every thinkable way, as discussed in sec. 4.2.6. While ethical concerns can be addressed by researchers themselves, data availability lies in the hands of the platforms, which see their data as a form of capital. As a consequence, data availability has become more restricted over time. This situation worsened for academics when Facebook further tightened its API access because of its abuse by the political consulting firm Cambridge Analytica (Bruns, 2018a). This is especially painful for researchers as network completeness is particularly important in the study of online social

20

CHAPTER 2. LITERATURE REVIEW networks because unlike traditional social science research, the members of online social networks are not controlled random samples, and instead should be considered biased samples. (Ugander, Karrer, Backstrom, & Marlow, 2011, p. 2) Nevertheless, due to the platforms’ generosity, to funding, or to an inventive

approach to data collection, researchers have still managed to produce an impressive body of work. As a comprehensive account of this large body of work is not possible, I present examples of research on the Facebook and Twitter platforms only. Furthermore, I only present material that is specifically connected to either network science or media studies or, preferably, to both.

2.2.1

Research on Twitter

To my knowledge, researchers examining the hyperlink networks of the internet did not have to justify the relevance of their research. In contrast, Twitter, as a research object, underwent three phases of debanalisation (Rogers, 2013): • First researchers tried to find evidence for its banality, • then they discovered it as a source of collections of newsworthy information, • and finally it became a collection of historical data that was even subject to an (abandoned) attempt by the Library of Congress (USA) to archive it. In line with a long tradition of content analysis studies of classical media, much research In media and communication studies focuses on what is, and has been, tweeted by Twitter users, and on what we can learn from this in form of a collection of text However, instead of a content analysis of these historical data, I focus (in this review, and in the thesis as a whole) on the investigation of structures, following Wasserman & Faust (1994) and “refer[ing] to the presence of regular network patterns in relationship as structure” (Wasserman & Faust, 1994, p. 3). A very intuitive way to look at structures is to visualise them. For example, Bruns (2011b) developed a workflow using text processing scripts written with gawk, and the network visualisation tool Gephi (Bastian, Heymann, & Jacomy, 2009) with @mention and hashtag networks. Influenced by this workflow, several studies of national Twitterspheres emerged; e.g., @mention networks of the Austrian Twittersphere

2.2. SOCIAL MEDIA AS A SOURCE OF EVIDENCE

21

Figure 2.1: 2016 Australian Twittersphere. Network of nodes with degree greater than 1000. Clusters labelled following qualitative review of leading accounts in each community, detected by modularity-based detection algorithm (source: Bruns et al. (2017))

by Ausserhofer & Maireder (2013), or the Dutch equivalent by Geenen, Boeschoten, Hekman, Bakker, & Moons (2016). Extending the mapping approach, rather than conversational @mention networks, Bruns et al. (2014) visualised the follower network of Twitter accounts that were identified as Australian, and had a combined in- and out-degree of at least 1000. Using a force-directed layout, a visualisation, which allows nodes to appear nearer if they have more common related nodes, it is possible to identify densely connected groups of nodes that can be interpreted as clusters of common interests. Using a community detection algorithm based on the maximisation of the link density within given groups compared to the expected density for a random network (modularity) – i.e., following a density-based definition of ‘community’ (see sec. 2.5.3) – Bruns et al. (2014) associated them with topics (e.g., politics, economy, sports) that have been determined by the profile description of the most popular accounts in their clusters. With this visualisation, and data about the spread of links or other entities (such as hashtags or media objects), it is possible to get a feeling for the reach of certain topics

22

CHAPTER 2. LITERATURE REVIEW

throughout the Australian Twittersphere. A follow-up study, with similar methods but more recent data (from 2016), by Bruns et al. (2017) shows that the clusters changed in size, but that the overall structure and its representation remained stable. Prima facie, this stability and plausible observations – such as the distinctive outside-position of the ‘Hard Right’ clusters next to the ‘Politics’ cluster, or the existence of subclusters of ‘Cycling’ and ‘Horse Racing’ in ‘Sports’ (see fig. 2.1) – validates this mixed-methods approach. We return to this approach in the second study in chapter 6. Using visualisation on a smaller scale, but with multiple, diverse datasets, Smith, Rainie, Shneiderman, & Himelboim (2014) found six characteristic macroscopic network structures of @mentions and retweets within keyword and hashtag-based datasets from Twitter that arise in certain communication contexts. In controversial discussions, for example, one can often find polarised crowds, reminiscent of the filter bubble or echo chamber concepts: that is, two clusters that are only sparsely connected, which are densely connected within themselves. Around a news media outlet or a journalist’s account, one can usually find broadcast networks: the broadcaster is mentioned or retweeted, with only a few interactions with and within the audience, while several clusters around it communicate among themselves (see fig. 2.2). However, visualisation alone cannot provide researchers with workable quantitative criteria to distinguish communication patterns at scale. To look at a network of the size to which Twitter communication networks often grow, we need an approach that turns visual heuristics into quantifiable and comparable measures that allow automated classification at scale. Himelboim, Smith, Rainie, Shneiderman, & Espina (2017) proposes a quantitative workflow to classify communication networks into the six structures found by Smith et al. (2014). It starts by measuring the centralisation1 . It then classifies the networks further by the direction of this centralisation2 ; the density of the overall network; its modularity; and the fraction of isolated communities and/or nodes. (For the whole workflow see fig. 2.3.)

1 2

i.e., how central the most central nodes of the network are compared to the rest i.e., whether the most central nodes are central due to incoming or outgoing communication

2.2. SOCIAL MEDIA AS A SOURCE OF EVIDENCE

23

Figure 2.2: Overview of six characteristic macroscopic communication network structures in Twitter communication networks found by M.A. Smith, Rainie, Shneiderman, & Himelboim (2014) (used with permission, source: http://www.pewinternet.org/2014/02/20/mapping-twitter-topic-networks-from-polarizedcrowds-to-community-clusters/figure-3/)

Each of these structures has its own implications for the flow of information and characteristic dynamics. For conferences, for example, a unified tight crowd

24

CHAPTER 2. LITERATURE REVIEW

Figure 2.3: Flow diagram of the quantitative classification of the six macroscopic communication network structures. CC BY NC (https://creativecommons.org/licenses/by-nc/3.0/) Himelboim, Smith, Rainie, Shneiderman, & Espina (2017).

ure 1. Twitter structure classification process.

may serve its purpose, connecting convention participants or members. For grassroots movements …, in contrast, forming an in-group network may suggest a failure to grow and diversify beyond their core group. (Himelboim et al., 2017, p. 10) As I note again in the discussion of the second study in chapter 6, such classifications might be a more helpful way to categorise publics than simply considering more-or-less closed chambers and bubbles. Another interesting example of quantifying qualitative properties of Twitter communication on a large scale was pursued by Sousa, Sarmento, & Rodrigues (2010), who examined the @mention network of about 50 000 Twitter accounts tweeting in Portuguese over two months, to answer the question of whether Twitter is shaped more

017 by SAGE Publications Ltd. Manuscript content on this site is licensed under eative Commons Licenses

2.2. SOCIAL MEDIA AS A SOURCE OF EVIDENCE

25

by social or topical motivations. They applied a measure used in information theory and statistics3 to provide insights into the question of whether the @replies of those accounts were motivated more by topics, or by the history of those to whom they have previously replied. At least for this dataset, they conclude that the @reply network on Twitter is a social rather than a topical network. Distinctions like this might be helpful in distinguishing other kinds of publics, issue publics, and communities (see sec. 2.4.2.2). Both are concepts that help in the understanding of the results of this project’s first study regarding hashtag and link diffusion in chapter 5. Myers, Sharma, Gupta, & Lin (2014) took the same question – that is, whether a network is a social or an information network – to a new level by looking at the entire Twitter follower network. Their definitions of the difference between the two kinds of networks relied completely on quantitative network measures. According to them, an information network exhibits large vertex degrees4 ; a lack of reciprocity5 ; and large two-hop neighbourhoods6 . On the other hand, they contend, a social network exhibits “high degree assortativity7 small shortest path lengths8 , large connected components9 , high clustering coefficients10 , and a high degree of reciprocity11 ” (Myers et al., 2014, p. 493). Examining these measures, Myers et al. (2014) obtained contradictory results. They provide evidence that the use of Twitter changes for most accounts over time, as the purpose of its network usage changes from a purely information-gathering purpose to a more social purpose. Therefore, they conclude that Twitter is a hybrid network. This makes the platform a suitable object of study for examining methods that address both types of networks. The fact that all authors of this study worked for Twitter at the time of its publication, however, points to an issue with Twitter data: accessibility. Driscoll & Walker (2014) compared the Streaming API to Gnip’s Firehose, and saw that the Streaming API exhibits large data losses in times of high activity. This is 3

the Normalized Pointwise Mutual Information of the ego networks of users tweeting about different topics 4 i.e., many connections to and from a single account/node 5 i.e., if an account follows another account, it is unlikely that the other account follows back 6 i.e., accounts have a lot of followers of their followers 7 i.e., preference of nodes/vertices to connect to nodes with a similar degree 8 i.e., the smallest number of hops to reach one node from another node on the follower network 9 where a connected component is defined as a subset of the network within which every node is reachable via the links between the nodes 10 i.e. the nodes are densely connected 11 i.e., if an account follows another account, it is likely that the other account follows back

26

CHAPTER 2. LITERATURE REVIEW

a concerning fact, especially given that a lot of research does not state which specific API or source it used to collect its data. Unfortunately, however, even with the use of the commercially offered Firehose, data integrity seems to remain a matter of trust. With the exception of Bruns et al.’s studies mentioned in this section, the abovementioned studies do not directly connect to established media and communication theories, as this thesis does. Nonetheless, there are researchers who are working to integrate their findings into media scholars’ discourse about structures in the public sphere, as presented later in sec. 2.4.2. This is especially the case within the idea of a two-step flow, or the assumed existence of opinion leaders. Both of these concepts go back to Lazarsfeld et al. (1944), and are still relevant to diffusion events as those investigated in chapter 5. When trying to answer the question “Who says what to whom on Twitter?” Wu, Hofman, Mason, & Watts (2011) examined tweets from the Twitter Firehose12 containing links shortened by the link shortener bit.ly. Using lists to categorise elite users (e.g., journalists or celebrities) and ordinary users, they found evidence for a two-step flow of communication (see sec. 2.4.2.4) on Twitter. Furthermore, they found that different content attracted different categories of users. The accounts in the elite categories also seemed to be highly homophilous: that is, most attention was spent on content from their own category. Both findings are interesting within the filter bubble and echo chamber discussion. Cha, Benevenuto, Haddadi, & Gummadi (2012) tried to find a way to categorise users “that is also meaningful in the context of existing theory” (p. 992). However, instead of first comparing and evaluating theories, they chose a data-driven approach. Through qualitative changes of the slope of the cumulative degree distribution function, they found a generic division of users into three groups. They identify these as mass media sources; evangelists (corresponding to opinion leaders in a two-step flow framework); and grassroots (i.e., ordinary users). They found that the mass media have a dominant role on Twitter, and can often “reach a large portion of the audience directly” (Cha et al., 2012, p. 997). Nevertheless, occasionally, evangelists and grassroots accounts broke news before the mass media, and evangelists “played a leading role in the spread of news in terms of the contribution of the number of messages and 12

all tweets posted in a specified time window

2.2. SOCIAL MEDIA AS A SOURCE OF EVIDENCE

27

in bridging grassroots who otherwise are not connected” (Cha et al., 2012, p. 997). Earlier, Cha, Haddadi, Benevenuto, & Gummadi (2010), when measuring influence by the number of followers, the probability of being mentioned, and the probability of spawning retweets, found that these three forms of influence are not necessarily congruent. They called this phenomenon the “Million Follower Fallacy”. While the follower number indicates overall popularity, they claim that mentions represent the name value of an account; i.e., the visibility and influence that comes from the identity of the account itself, not from its content or its followers. Nevertheless, they found a “highly skewed ability of users to influence others” that follows a power-law structure; a strong correlation of influence ranks across topics; and a certain stability of influence over time. They see this as evidence for the existence of opinion leaders on Twitter. However, one can criticise the fact that they (admittedly) do not normalise spawned retweets and mentions by the number of total tweets tweeted: “When [they] tried normalizing the data, [they] identified local opinion leaders as the most influential. However, normalization failed to rank users with the highest sheer number of retweets as influential” (Cha et al., 2010, p. 13). Finally, An, Cha, Gummadi, & Crowcroft (2011) not only found evidence for the importance of a two-step flow on Twitter, but also provided data to question the filter-bubble and echo-chamber hypotheses regarding social media. While users seem to tend to focus on single topics in their media consumption, An et al. (2011) “also observe that certain media sources, especially journalists, excel in connecting media from different topics, indicating that Twitter users who follow journalists tend to seek more diverse types of information” (p. 18). At the same time, “media organizations reach a considerably larger audience through indirect exposure via social links” (p. 19). This exposure also leads to “users receiv[ing] information from six to ten times more media source than from direct exposure alone” (p. 19). It also crosses political boundaries, as “between 60-98% of the users who directly followed media sources with only a single political leaning (left, right, or center) are indirectly exposed to media sources with a different political leaning” (p. 19). Taken as a whole, this body of research on communication structures on Twitter shows that Twitter as a platform is a rich repository for the observation of public communication patterns. While these patterns are constrained and shaped by the af-

28

CHAPTER 2. LITERATURE REVIEW

fordances of the platform, they are still general enough for some of the studies to be connected with essential theories from media and communication studies. Often, however, this engagement with theory simply scratches the surface, as does the employment of more sophisticated network science methods. Furthermore, due to a lack of access to, the availability of, or the means to process data, most studies do not deal with networks of the size that reach the scale of whole societies. This thesis makes a contribution in all three of these directions.

2.2.2

Research on Facebook

While Facebook is globally the most used social media platform, research related to this platform is harder to find than research related to Twitter. Also this thesis focuses on Twitter data, not only because Facebook can only partly be seen as public communication, but also because this fact leads to data access and ethical problems. However, also Facebook can be used to investigate public communication with a focus on network structures. Using Facebook data, a more recent study by Del Vicario, Zollo, Caldarelli, Scala, & Quattrociocchi (2017) sheds light on polarisation around news media Facebook pages during the Brexit debate. Their study is especially interesting as it makes use of community detection algorithms, which are also employed in the second study of this thesis in chapter 6 to detect polarised clusters. More, specifically, they used a graph in which a link was created between two pages that reported on Brexit. If a user liked a post on both pages, they identified a community structure consisting of two to three communities, depending on the community detection algorithm used. (While Del Vicario et al. (2017) just mention these differences in their findings, they highlight the importance of choosing the ‘right’ algorithm for the right purpose or, at least, of understanding and communicating the implications of this choice, as outlined in sec. 2.5.3.) Del Vicario et al. (2017) identifies a strong polarisation regarding the probability of leaving comments or likes on either community detected by one of the density-based algorithms. Likes were interpreted as a signal of positive sentiment about the content. With the help of sentiment analysis, using the commercially marketed natural

2.2. SOCIAL MEDIA AS A SOURCE OF EVIDENCE

29

language analysis features of the IBM Watson API, Del Vicario et al. (2017) showed that posts in both communities often have opposing sentiments. However, their usage of the term ‘echo chamber’ can be criticised, because their presented results also suggest that comments left on these posts often oppose the sentiments in the posts themselves. As the communities are based on likes only, and not on comments, an analysis of the network based on comments would be needed to confirm that users engage primarily with only one community of pages. The fact that there is no available data regarding exposure to content without user interaction remains another problem. Nevertheless, their methodology is interesting to pursue further, especially because the data used are publicly available. At the same time, their shortcoming of providing a rationale for the employment of choosing a certain community detection algorithm over others gives cause for the investigation of the implications of choosing one algorithm over the other. An example of this kind of investigation is undertaken in chapter 6. In general, access to Facebook data is even more limited than access to Twitter data. Thus, research usually either focusses on small subpopulations, or tries to sample data via crawling or other means. However, the information gained is never complete; for example, information about whether the examined accounts are active or stale (Ugander et al., 2011). To my knowledge, the only published study of the global structure of the Facebook graph was done by Ugander et al. (2011). Ugander et al. (2011) were working for Facebook at this time and, therefore, had access to this huge amount of data and the means to process it. Despite its age and potential bias in result selection, the study is still interesting as, at the time of the study (May 2011), Facebook already had 721 million active13 users (Ugander et al., 2011, p. 2), or around 10% of the world’s population. These users included half of the population who were able or allowed to access Facebook in the USA (Ugander et al., 2011, p. 3). The 2011 Facebook graph shows some features that are resonant of a small-world network; e.g., the Watts-Strogatz model by Watts & Strogatz (1998) (see sec. 2.5.2). While the network is sparse, “the average distance between pairs of users was 4.7 for Facebook users and 4.3 for U.S. users” (Ugander et al., 2011, p. 5). This is explainable by the existence of very dense neighborhoods of nodes. Almost all users (99.91%) were located in one component of the network in which every every node can be reached 13

logged in within 28 days

30

CHAPTER 2. LITERATURE REVIEW

from every other node (connected component) (Ugander et al., 2011, p. 5), making it theoretically possible for a piece of information to traverse almost the entire global network without leaving Facebook. Of interest from a social science point of view, especially with respect to a focus on the structure of publics and audiences, is the fact that there is a strong homophily with respect to age, and this weakens with increasing age. This homophily contrasts with a minimal homophily regarding gender (Ugander et al., 2011, pp. 10–11). 84.2 percent of edges were within countries (Ugander et al., 2011, p. 12). Meanwhile, edges between countries show a clear group structure, with “more curious groupings, not clearly based on geography, include the combination of the United Kingdom, Ghana, and South Africa, which may reflect strong historical ties” (Ugander et al., 2011, p. 13). These results are remarkable, as they show that this level of overview of the global Facebook network already reveals structures relevant not only to media and communication, but also to social and political science scholars. These results also motivate the second empirical study of this project in chapter 6, which results in a detailed overview of the Australian Twittersphere.

2.2.3

Summary

“Historically, studies of social networks were limited to hundreds of individuals as data on social relationships was collected through painstakingly difficult means. Online social networks allow us to increase the scale and accuracy of such studies dramatically” (Ugander et al., 2011, p. 1). These platforms produce qualitative data that are precise and structured enough for quantitative analysis. Due to their nature as concurrent manifestations of social interaction and as media and communication channels, media and communication scholars should be at the forefront of their study. Admittedly, a comprehensive overview of all research dealing with social media data with an interest in public communication is not possible. However, the reviewed social media research dealing with structures and dynamics in the public sphere, while promising and interesting, shows a kind of echo chamber problem. This research appears to be scattered among disciplines, and no major goals or dominating discourses are discernible, despite the fact that the various disciplines are dealing with the same,

2.3. RESEARCH CHALLENGES FOR MEDIA AND COMMUNICATION STUDIES31 highly relevant topic. This, in turn, leads to a lack of common theories and established methods across the disciplines. In the next section, I outline literature that sheds some light on the reasons for this problem.

2.3 Research Challenges for Media and Communication Studies From a topical point of view, media and communication studies should be the field in which all of this research dealing with communication structures and dynamics is consolidated, summarised, synthesised, and interpreted. However, there are challenges that hinder the uptake of relevant research from the various disciplines that have traditionally not collaborated closely with media and communication studies, such as computer science, mathematics, and physics. These challenges have their roots in the organised complexity of a networked public sphere, and are worsened by the technical difficulties of handling the amount of data available to investigate these ‘problems of organised complexity’ (as explained below). Overcoming these challenges, is further impeded by the disciplinary divides among the researchers involved. As discussed in more detail below – for example, in sec. 2.5.1 – the analysis of structures and dynamics within the networked public sphere can be seen as an analysis of ‘problems of organised complexity’, “where the identity of the elements involved in a system, and their patterns of interactions, [can] no longer be ignored” (Hidalgo, 2016, p. 2). While there are also problems of organised complexity in the natural sciences (such as physics), systems in the natural sciences are mostly “systems that have no agency — that is, no actors capable of processing their own information about the world they inhabit and that are free to react in accordance” (González-Bailón, 2013, p. 154). Complex systems in the social sciences are made even more complex by the agency of their constituents. Additionally, the subjective ambiguity of the meaning of terms and the interpretation of social constructs, which are carried into the seemingly objective measures and metrics produced by the methods coming from natural and computational sciences, complicate matters even further (see also Baym, 2013, sec. 5.3). In light of the above, a researcher from media and communication studies faces high costs and a great deal of effort if they wish to integrate methods and findings

32

CHAPTER 2. LITERATURE REVIEW

from these disciplines. Furthermore, these costs and this effort are not always justified by a demand beyond academia. If we take audience metrics as an example of the challenges, media scholars who try to find measures to truly capture the complexity of the media environment today, face resistance. This is because the skills needed by media production professionals (e.g., journalists), to understand these metrics are significant. Instead, what has emerged to date often more closely resembles a kind of newsroom cargo cult, in which upward trends in basic audience metrics are celebrated and pursued as ends in themselves, rather than critically interrogated as indicators of broader patterns in audience engagement. (Bruns, 2017b) If we consider that ‘audience’ “is a discursive construct, articulated in different ways to serve differing purposes” (Baym, 2013), it is clear that these ‘basic audience metrics’ – such as simple numbers of shares, followers, or views – cannot capture the whole picture. (This is further detailed in sec. 2.4.2.2.) It becomes all the more clear with the availability of social media data and methods from which we can draw a more detailed picture; adapt it to whatever we conceptualise as ‘audiences’ or ‘publics’; and “advance well beyond conventional television ratings and print circulation figures, measuring not the broad distribution of journalistic content but the specific uses made of it, and responses to it” (Bruns, 2016a, p. 3). Thereby, the concept of an ‘audience’ exemplifies how problems of organised complexity in a networked public sphere are complicated further by the inherent and necessary fuzziness and subjectivity of definitions in the social sciences. Additional to this complexity arising from the social nature of problems in communication, is the fact that the methods of handling the amount of data available are so new to the media studies fields to which they are relevant, “that many journalism researchers (as well as researchers in the related fields of media, communication, and internet studies) lack the methodological training and research expertise to use big data effectively or even correctly” (Bruns, 2016a, p. 15). While working in teams with colleagues from other disciplines can help, “it remains incumbent on journalism [and media] scholars to develop their own methodological skills both in order to collaborate

2.3. RESEARCH CHALLENGES FOR MEDIA AND COMMUNICATION STUDIES33 more effectively with these colleagues, and to ensure that their research strategies are appropriate to the project at hand” (Bruns, 2016a, p. 15). Therefore, along with technical challenges – such as choosing an appropriate software architecture and storage technology (Stieglitz, Mirbabaie, Ross, & Neuberger, 2018, p. 163), or obtaining high quality data (Stieglitz et al., 2018, p. 164) – there comes an educational challenge. This is because: social scientists do not have the methods at their disposal necessary to discover, collect and prepare relevant big social media data. On the other hand, many of the researchers who are currently applying computational approaches could benefit from a more solid grounding of their approaches in existing social theory. (Stieglitz et al., 2018, p. 161) For example, the meaningful visualisation of such big social media data, which is crucial already in the earliest stages of data collection and selection (Stieglitz et al., 2018, p. 164), necessitates both perspectives – the technical and the theoretical – in an intimately integrated form. This integration of theory and methods is further hindered by a “gulf between the social and the computational sciences” (Stieglitz et al., 2018, p. 161). This gulf is caused by the fact that each of the disciplines involved not only “has its own tradition and merits, but also its own prejudices” (Stieglitz et al., 2018, p. 161). In the extreme, these prejudices lead, inter alia, to sweeping conclusions, such as the prophecy of an ‘end of theory’ (Anderson, 2008). This theory assumes that if there is enough data to feed machine learning algorithms to make predictions, models will not be necessary in the future. Unfortunately, prophecies such as these deepen the gulf even further, provoking social scientists for whom it seems obvious that “theory and interpretation are more necessary than ever before if we are to find the appropriate layer of information in what otherwise is an unnavigable sea” (González-Bailón, 2013, p. 147). The result is all too often a perceived lack of respect for the achievements of both sides and, therefore, a blind eye is given to the chance of merging them:

34

CHAPTER 2. LITERATURE REVIEW There are, very obviously, great opportunities in using big data to further this field of research, but these opportunities will not be able to be fully realized without substantial further methodological and conceptual development.… [We] must work furiously to develop, test, and document our transdisciplinary skills, methods, approaches, and frameworks for the use of big data in Journalism [and Media] Studies, and engage in a frank and open debate about the limits of such approaches – not in order to dismiss them altogether and defend established journalism [and media] research practices from this new disruption, but to determine where they may make a useful contribution to the existing methodological toolkit. (Bruns, 2016a, pp. 15–16) This research project does so by giving equal weight to both – media and com-

munication studies theory and to approaches from the natural, mathematical, and computational sciences – in addressing two central problems of communication in a networked public sphere: the diffusion of information; and the emergence of communities in the form of micro-, meso-, and macro-publics. The former problem is addressed in chapter 5, which draws a differentiated picture of virality and contagion informed by both sides of the disciplinary spectrum. The latter is addressed in chapter 6, which questions the epistemological implications of community detection algorithms for media and communication theory related to a networked public sphere. Addressing these two problems from both perspectives “requires working on the grounds of a common language … The mathematical language of networks opens one such point of contact, as does coding and programming” (González-Bailón, 2013, p. 158). Both network science and programming are the tools of choice for this project. However, if the focus is on tools alone, it is easy to forget that this common language also needs a common understanding of what it is good for, and what it can achieve. As I argue in chapter 4.1, the so-called ‘computational turn’ in the humanities led to a need for an epistemological framework that can integrate rather positivist ideas from the natural and computational sciences with rather constructivist positions from the social sciences and humanities. To assist readers from both sectors, however, with a focus on the (networked)

2.4. THE MEDIA AND COMMUNICATION STUDIES PERSPECTIVE

35

public sphere and mass communication, we first need to recap the relevant theories, models, and methods from media and communication studies in sec. 2.4, and also from network science in sec. 2.5.

2.4 The Media and Communication Studies Perspective The field of media and communication studies is, per se, an interdisciplinary subject; therefore, its edges are blurry. Nevertheless, in this section, I summarise the research and theory from a tradition shaped and influenced by theorists who are honing their models of, and hypotheses about public communication in a scholarly discourse, and by methods that rely mostly on survey and interview data that is analysed either qualitatively or with the toolbox of descriptive and inferential statistics. While sec. 2.4.1 takes a walk through the transforming models relevant to a networked public sphere that emerged during the last 100 years, sec. 2.4.2 illustrates that many established theories and hypotheses are easily interpreted as assumptions about network structures. Both are, therefore, promising fields to examine with the use of network science methods.

2.4.1 The Rise of Networked Public Spheres In early mass media communication, one of its most pivotal inventions, the radio, found some critics, despite the euphoria surrounding its invention. Its inventors had not intended to develop a one-way broadcasting device; rather, they thought they had invented wireless telephones (Flusser, 2011, p. 284). In line with this perception Bertolt Brecht‘s so-called “Radiotheorie” (radio theory) of the 1920s claimed the necessity for a backchannel for radio listeners in order to create – in his view – a truly useful medium (Brecht, 1932). However, as things turned out, until the advent of television, the radio became a pinnacle of mass media and propaganda. It created the virtual stage that Habermas described with his highly influential model of a transforming public sphere (Habermas, 1962): a mediated discourse sent one-way from elites to a more or less non-responsive audience. While the audience can have an indirect effect on the content that is distributed by broadcast mass media, ordinary members of the audience do not have the possibility to talk back directly to the TV programmers. Therefore, these one-to-many mass media can help to establish a form of cultural citizenship: they “can

36

CHAPTER 2. LITERATURE REVIEW

stimulate the desire for freedom, comfort, politics, culture” (Hartley, 1999, p. 188), but can hardly fulfil it. Already in the 1970s, Flusser (2011, p. 44) saw mass media channels as speakers in virtual amphitheatres, uni-directionally sending unalterable content into a network of group dialogues and, thereby, synchronising the content of the latter. Flusser’s drawings of different communication structures, which he considered to be basic building blocks of communication in public and in private, are strikingly reminiscent of network visualisations. These communication structures were a major inspiration for many of the ideas in this thesis, especially in the (later) discussion of the results of the study of publics and communities in sec. 6. Flusser conceived the synchronising and unifying force of the mass media to be so powerful that it suppressed original and new content in the traditional birth places of new ideas, for example, in family dialogue. Therefore, the transformation of a mass broadcast medium into a dialogical medium such as the social web (impossible in the 70s) was, for him, the last possible revolutionary act of a contemporary consumerist society (Flusser, 2011, p. 271)14 . The seemingly revolutionary happened: “With the advent of the Net, civic interaction [took] a major historical step by going online, and the sprawling character of the public sphere [became] all the more accentuated” (Dahlgren, 2005, p. 149). Already, before social media emerged as a mass phenomenon, it was evident that [t]he Internet is at the forefront of the evolving public sphere, and if the dispersion of public spheres generally is contributing to the already destabilized political communication system, specific counter public spheres on the Internet are also allowing engaged citizens to play a role in the development of new democratic politics. (Dahlgren, 2005, p. 160) The expectation was that the internet would create a media environment that was more participatory, and less controlled. At the time of their emergence, both radio and television, however, had also raised hopes of a more participatory media environment, as they had the inherent effect of empowering the masses through a previously 14

translated from German: “Revolutionär wäre, solche diskursiven Medien zu dialogischer Funktion umzuwandeln. Meiner Meinung nach ist dies die heute noch einzig mögliche Form einer revolutionären Aktion in der Konsumgesellschaft.”

2.4. THE MEDIA AND COMMUNICATION STUDIES PERSPECTIVE

37

unthinkable, indiscriminate access to everyday political information. Today, for example, we see brands, parties, politicians, and governments using the internet to observe and influence the masses in ways that were not earlier imaginable. However, even if this usage is more participatory and less controlled, this lack of control and coordination has side-effects, such as hate-speech, cyber-bullying, or audience fragmentation. These side effects are closely related to current claims about filter bubbles and echo chambers (see sec. 2.1). As early as 2006, Habermas reviewed his influential model of a transforming public sphere that is rooted in networks. Even if he does not explicitly take the emergence of the Web and online social networks into account, he includes “actors of civil society” (Habermas, 2006, p. 415) as nodes of these networks. Locating these networks at the periphery of the political system, he claims that a deliberative model of society needs a self-regulating media system that maintains its independence. At the same time, “an inclusive civil society must empower citizens to participate in and respond to a public discourse” (Habermas, 2006, p. 419). Notwithstanding, from his perspective, journalists and politicians play the most important role in his model of the public sphere. Therefore, the assumption of some scholars that political journalism as we know it will end is, for him, a loss of the “centerpiece of deliberative politics” (Habermas, 2006, p. 423). Corresponding to the expected dispersion of (the) public sphere(s), McQuail (2010) distinguishes four phases of audience fragmentation (see fig. 2.4): • A unitary model, where the whole audience relied on a small number of one-waychannels that did not differ significantly in the content they sent; • A pluralism model, where the number of channels grew and, with it, the differentiation of content; however, people still lived in the same informational context; • Following this, a core-periphery model, in which the multiplication of channels undermines the unity of the framework (this is the stage that he sees the most developed countries to be in); and • Finally, as a future possibility (from McQuail’s 2010 perspective), the breakup model, “where fragmentation accelerates and there is no longer ‘any centre’, just very many and very diverse sets of media users” (McQuail, 2010, p. 444 f.).

38

CHAPTER 2. LITERATURE REVIEW

Figure 2.4: Four stages of audience fragmentation (source: McQuail (2010, p. 445), original version: Figure 8.1, (p. 138 ) in McQuail, Denis. Audience Analysis. Copyright © 1997 by SAGE Publications, Inc. Reprinted by permission of SAGE Publications, Inc.)

Assuming this breakup scenario, Habermas and McQuail could have underestimated or overlooked the influence of rather unorganised, but efficiently connected crowds. At its periphery, the “unruly life of the public sphere” (Habermas, 2006, p. 417) moved more into the centre of the public discourse as more powerful, more measurable, but also, perhaps, easier to mislead and influence. So argues Bruns (2008, p. 66): While it is true that the independence and intermediacy of the public sphere is threatened, “it is possible to point to a variety of new spaces which augment and supplement the mass-mediated public sphere”. Classical journalism becomes an echo chamber for a closed discourse between elite journalists and politicians, while “mass media organisations now become little more than clusters or nodes in the wider net-

2.4. THE MEDIA AND COMMUNICATION STUDIES PERSPECTIVE

39

work” (Bruns, 2008, p. 69). However, from his perspective, their services as social glue are no longer needed because: What we see emerging, then, is not simply a fragmented society composed of isolated individuals, but instead a patchwork of overlapping public spheres centred around specific themes and communities which through their overlap nonetheless form a network of issue publics that is able to act as an effective substitute for the conventional, universal public sphere of the mass media age. (Bruns, 2008, p. 69)

Figure 2.5: The long tail distribution of audience size and engagement according to Bruns (2008, p. 70)

Bruns describes a slope with villages of issue publics in highly engaged niches around a mainstream mountain with rather low active participation (see figs. 2.5, 2.6), connected by highways of information exchange that will, despite having some autonomy, depend on each other and stay connected. This matches McQuail’s third model of a core-periphery structure. However, McQuail’s vision in his breakup model goes further: It rather resembles islands of separated publics that are moving away from each other, driven by some continental drift. Both of McQuail’s predictions about the

40

CHAPTER 2. LITERATURE REVIEW

Figure 2.6: An artistic depiction of the model of the public sphere by Bruns (2008, p. 71)

future of the public sphere, which had already been raised by him (in 1997) or by Bruns (in 2008), are still justified in 2017. However, audiences could either become more fragmented, or more integrated as the result of increased interactivity and interconnectedness (McQuail, 2010, p. 408). Indeed, after a number of surprising political upheavals in 2016, for many, the fragmentation seemed to have already happened: The recent emergence and success of political movements that appear to be immune to any factual evidence that contradicts their claims – from climate change denialists through Brexiteers to the ‘alt-right’, neo-fascist groups supporting Donald Trump – has reinvigorated claims that social media spaces constitute so-called ‘filter bubbles’ or ‘echo chambers’. (Bruns, 2016b, para. 1). However, this claim seems to some “not particularly well supported by the available facts” (Bruns, 2016b, para. 3). Bruns (2016b) cites Duggan, Smith, & Page (2016) (who presented representative survey data from the USA) to argue that, for example, “50% of social media users have been surprised by one of their social media connections’ political views”. Bruns claims that this fact contradicts the notion of an echo chamber. Rather, Bruns & Highfield (2016) argue that we have to rethink the public sphere concept. Instead of “a universal, nationwide public sphere” (Bruns & Highfield, 2016,

2.4. THE MEDIA AND COMMUNICATION STUDIES PERSPECTIVE

41

p. 4), or a space full of hermetically sealed bubbles (Bruns & Highfield, 2016, p. 9), they draw a hierarchical framework for a new networked public sphere model with a number of rather static, slowly emerging and fading public sphericules (e.g., regarding climate change) that overlap domain-based publics (e.g., politics, science, and industry), and are connected by rather short-lived issue publics (e.g., around reports, conferences, protests, or politicians’ statements) (Bruns & Highfield, 2016, p. 17). However, there is currently no clear evidence for either side of the debate, and proponents of echo chamber and filter bubble hypotheses also have convincing arguments (see sec. 2.1.2). Therefore, more empirical research is needed and, pragmatically speaking, both sides can, do, and will, profit from the use of network science methods: either to define and give quantified, empirical evidence for or against the constructs of filter bubbles and echo chambers; or to reject, or verify and extend a model of a networked public sphere. Why, and how, a deeper engagement with network science methods is most promising for advancing theory – irrespective of which hypothesis turns out to be or to become reality – is demonstrated in the second study of this thesis in chapter 6.

2.4.2 Theories about Network Structures in the Public Sphere Descending from the macro-level public sphere to the mechanics within public sphericules, issue publics, and private spaces, theories of media and communication scholars on such questions can be interpreted as theories about network structures and dynamics in the public sphere. In this section, I outline some of these theories and concepts, and detail their translation into network structures.

2.4.2.1 Diffusion of News and Information News events are for communication scholars what drosophila melalagasters [sic] (fruit flies) are for geneticists: A new generation comes along very quickly. The diffusion of a news event is a discrete mass communication function, and its study sheds light on the complex process through which the mass media convey news stories to audience individuals, who then interact with each other as they give meaning to the news. Thus, news event

42

CHAPTER 2. LITERATURE REVIEW diffusion exemplifies an inter-media process in which the media stimulate interpersonal communication among audience individuals, which in turn can stimulate behavior change. (Rogers, 2000, pp. 562–563) News or information diffusion research seeks an answer to one of the basic

questions of communication studies: “Who - Says What – In Which Channel – To Whom – With What Effect?” (Lasswell, 1948). However, the “unpredictable nature of the occurrence of a news event, combined with its rapid diffusion, makes this process difficult to study, at least in the rather deliberate way that most communication research [was] conducted” (Rogers, 2000, p. 563). To measure the diffusion of news needs planning “well in advance of the news events” (Rogers, 2000, p. 563). Therefore, Deutschmann and Danielson in 1957, introduced a new paradigm of research about news event diffusion. It consists of three main elements (Rogers, 2000, p. 565): 1. A preplanned methodology to gather data while the recipients still recall how they have learned about the event. 2. Data-gathering about multiple events to make comparisons. 3. “Focus on the rate of diffusion, mass media and interpersonal channels of diffusion, the two-step flow via interpersonal channels stimulated by the media, and the perceived salience of a news event” (Rogers, 2000, p. 565). Consideration of these three points in the light of the availability of social media data raises certain issues. With regard to the first point, it is no longer necessary for recipients to recall when they learned about an event. If access to their digital traces is given (a big ‘if’, but nevertheless), one can infer with quite high confidence when and how this happened. The second point – data-gathering about multiple events to make comparisons – has become much easier. Finally, the third point – “the rate of diffusion, mass media and interpersonal channels of diffusion, the two-step flow via interpersonal channels stimulated by the media, and the perceived salience of a news event” – has become more straightforward to quantify, despite the fact that requirements for the technological skills of the researcher have increased. Furthermore, most digital trace data addresses only one form of news engagement: via social and/or online media.

2.4. THE MEDIA AND COMMUNICATION STUDIES PERSPECTIVE

43

Nevertheless, this paradigm “set the pattern for the many news diffusion studies that were to follow over future decades” (Rogers, 2000, p. 565). Traditionally, the diffusion of news is still understood as “its takeup and incorporation into what people ‘know’ ”, and is still focused on four main variables: the extent to which people (in a given population) know about a given event; the relative importance or perceived salience of the event; the volume of information about it that is transmitted; and the extent to which knowledge of an event comes first from news media or from personal contact (McQuail, 2010, p. 510). The growing number of channels and the decline of centralised mass media make these variables more difficult to determine. Nevertheless, the fact “that word of mouth plays a key part in the dissemination of certain kinds of dramatic news is continually reconfirmed” (McQuail, 2010, p. 511), and social media is considered to be able to amplify word-of-mouth effects (McQuail, 2010, p. 470). Due to the fact that users now have comparable possibilities to produce (and not only to share) content, they have become ‘produsers’ (Bruns, 2008). Therefore, the distinction between mass media and personal contacts became blurry. From a network science perspective, as fig. 2.7 illustrates, these kinds of problems can be translated into contagion processes on (multilayered) networks, with news media sources and audience members as nodes. As is seen in sec. 2.5.4, network science methods have already been applied to address some of the problems mentioned above.

2.4.2.2 The Perceived Audience: From Undifferentiated Masses to Ad Hoc Issue Publics and Communities To understand and build theories about a networked public sphere necessitates a closer look at predecessors of the sub-publics that Bruns & Highfield (2016) propose. The most prominent and relevant concepts in mass communication theory are the perceptions of a group of actors in a media environment as an audience, a public, or a community. This section discusses the several notions of these terms that will be especially important for the case study that investigates the structure of the Australian Twittersphere in

44

CHAPTER 2. LITERATURE REVIEW

Figure 2.7: Schematic of a news diffusion network

chapter 6. However, it is also relevant to information diffusion, which is the focus of chapter 5. The notion of an ‘audience’, through which news and information diffuse, is complex and has been subject to change over the last decades. ‘Mass’ media audiences have been the focus of public and cultural policy since there were masses to mediate. They were for decades regarded as relatively undifferentiated, unknowable, by turns desirable (redeemable) and threatening (revolutionary). The technologies of communication characteristic of the twentieth century have been designed to reach them and regulate

2.4. THE MEDIA AND COMMUNICATION STUDIES PERSPECTIVE

45

Figure 2.8: Schematic of perceptions of the audience in a broadcast mass media environment before the emergence of networked mass media

them, influence them and stop them being influenced. Oddly enough, these great unknowable masses that have stalked the pages of social and media theory, government legislation and cultural criticism since the nineteenth century have themselves been the locus of the development of the form of citizenship based not on sameness (undifferentiated mass), but on difference. (Hartley, 1999, p. 164) The perception of the mass media audience as an amorphous mass cannot be a correct one. If one is mostly interested in audience size, it might seem to be a sufficient model to deal with star-shaped networks of recipients around some broadcasting sender (see fig. 2.8). However, from both a normative perspective, and the need to understand and explain more complex audience behaviours than channel switching, we need to go further. The Internet made it obvious that the “audience member is no longer really part of a mass, but is either a member of a self-chosen network or special public or an individual” (McQuail, 2010, p. 140). With this insight the idea of the ‘virtual community’ “that can be formed by any number of individuals by way of the Internet

46

CHAPTER 2. LITERATURE REVIEW

at their own choice” (McQuail, 2010, p. 150), emerged. The introduction of the term ‘community’ came with many definitional issues. “Once upon a time, we thought we knew what communities were: small knots of people in local areas (‘neighborhoods’) where people knew each other and were mutually supportive” (Gruzd, Jacobson, Wellman, & Mai, 2016, p. 1187). Then things became complicated. Inter alia, the term was expanded “to aggregates of people with similar attributes or characteristics (such as ‘the gay community’)”, or to “imagined communities”(Gruzd et al., 2016, p. 1187) whose members have never met. Meanwhile, “[c]omputer scientists have focused on connectivity without mindfulness” (Gruzd et al., 2016, p. 1188) by using community detection algorithms that were based only on the density of observed connections. If we follow a community definition that requires a closer connection between its members, describing the networked public sphericules and issue publics explained above as ‘communities’ would often exaggerate the quality of their connectedness. Bruns & Burgess (2011) argue that referring to hashtag ‘communities’ on Twitter would imply that the members of those networks would “share specific interests, are aware of, and are deliberately engaging with one another, which may not always be the case”. Therefore, choosing ‘issue public’ to describe the loosely connected members of an active audience, and to distinguish it from a ‘community’, seems more appropriate. Following this definition, the differentiation between community and issue public would rely on properties of the members’ network (e.g., on whether the ties between members are social, topical, reciprocal, uni-directional or, indeed, whether they even exist). One could use a multilayer network, consisting of a social and an information layer (see fig. 2.9), to operationalise this notion of different kinds of audience groups. The possibility of every member of any audience being a member of an arbitrary number of communities and publics online, leads to a context collapse15 . Using the example of a video to be recorded for Youtube, this context collapse is described in a blog post by Wesch (2008): 15

It is unclear who actually coined the term. danah boyd refers to it in published writing as collapsed context before 2009 (boyd, 2013) and, in 2009, the term context collapse came up in Wesch (2009) and Marwick & boyd (2010) (who had submitted their article in 2009). boyd ascribes this to a possibly independent, parallel development of the same idea, building on similar sources: “While we ran in the same circles, I’m not sure that either one of us was directly building off of the other but we were clearly building off of common roots” (boyd, 2013)

2.4. THE MEDIA AND COMMUNICATION STUDIES PERSPECTIVE

Figure 2.9: Schematic illustrating the differentiation between community and issue public

The problem is not lack of context. It is context collapse: an infinite number of contexts collapsing upon one another into that single moment of recording. The images, actions, and words captured by the lens at any moment can be transported to anywhere on the planet and preserved (the performer must assume) for all time. The little glass lens becomes the gateway to a blackhole sucking all of time and space – virtually all possible contexts – in upon itself. The would-be vlogger, now frozen in front of this black hole of contexts, faces a crisis of self-presentation. (Wesch, 2008, paras. 4–5)

47

48

CHAPTER 2. LITERATURE REVIEW The crisis of the vlogger16 goes hand in hand with the crisis of the network

scientist who is trying to identify the audiences, communities, or issue publics that they are a member of: “In OSNs [Online Social Networks] the tendency to belong to several groups at the same time becomes visible and reaches unprecedented levels”, and this leads to “the presence of a large core with no apparent community structure – also known as fur ball networks” (Dickison, Magnani, & Rossi, 2016, p. 98). Nevertheless, algorithms for hierarchical or overlapping community structures that rely on more than “connectivity without mindfulness” (Gruzd et al., 2016, p. 1188), have been subject to much research in the last few years (see sec. 2.5.3).

2.4.2.3

Gatekeeping

If we turn our attention to the behaviour of the nodes in this complex, multi-layer network and their impact on the diffusion of news and information, it is not possible to ignore the concept of gatekeeping. This concept is especially important in connecting findings of information diffusion research and relevant methods from network science (as is later done in chapter 5) with traditional mass communication theory, and to interpreting findings in a way that is meaningful to media and communication studies. The concept of gatekeeping has its origins with Kurt Lewin (1947), who “conceptualized the social world as a relationship between individuals and groups” (DeIuliis, 2015, p. 5). To conceptualise how individuals affect what passes through the channels to the connected groups, he proposed the concept of ‘gatekeeping’, which not only affects information, but everything that passes through a society. The ‘gatekeepers’ decide what slips through. Thus, according to Lewin, social change occurs by influencing the gatekeepers: “To understand the forces, one must first identify the gatekeepers, then change [sic], or change the mind of the gatekeepers” (DeIuliis, 2015, p. 7). Only three years later (in 1950), David Manning White applied the gatekeeping concept to the selection of news. He tried to understand how a newspaper editor, with the pseudonym ‘Mr. Gates’, made decisions on what events would be covered in his papers (DeIuliis, 2015, p. 8). This lead to a model of the spread of news, depicted schematically in fig. 2.10. Subsequent research in the 90s found support for this focus on the subjective decisions of an individual. Later, however, this was relativised by research 16

video blogger

2.4. THE MEDIA AND COMMUNICATION STUDIES PERSPECTIVE

49

Figure 2.10: Schematic of a gatekeeping network

that “found that routine forces had a greater impact on the gatekeeping decisions of both online and print journalists than individual factors” (DeIuliis, 2015, p. 9). This might also be due, however, to a change in professional routines over time. With respect to changes in media practices in reaction to the interactive possibilities of online journalism, Bruns introduced the concept of ‘gatewatching’ (Bruns, 2011a). For him, “gatekeeping practices were simply a practical necessity” (Bruns, 2011a, p. 118) that was “born out of an environment of scarcity (of news channels, and of newshole space within those channels), [and] any growth in the overall newshole must necessarily challenge its role” (Bruns, 2011a, p. 120). As digital possibilities offer abundance rather than scarcity, he describes the inclusion of the audience in the collaborative curation of information as ‘gatewatching’ (see fig. 2.11). In 2009, Shoemaker and Vos “synthesized the extant models of gatekeeping into a model of the gatekeeping field. They argue that the constructs of gates, gatekeepers, forces, and channels are as relevant now as they were for Lewin” (DeIuliis, 2015, p. 10). Therefore, they included the fields of the sources, the media, and the audience. Every one of these fields has its own gatekeeping mechanisms (DeIuliis, 2015, p. 10). However, in 2008, Barzilai-Nahon already saw the lack of a theoretical foundation for gatekeeping

50

CHAPTER 2. LITERATURE REVIEW

Figure 2.11: Schematic of the gatewatching process after Bruns (2011)

in new media technologies (DeIuliis, 2015, p. 11), and proposed the concept of ‘network gatekeeping’ as a solution. Network gatekeepers not only control the information that flows to their connected groups; they also try to lock those groups in a network. In this network, they try to protect their norms and information, while ensuring an unhindered information flow within the boundaries of this network (DeIuliis, 2015, p. 13). Or, in other words: Network gatekeeping theory extends traditional gatekeeping theory beyond selection of news, content, emotions, or information to addition, withholding, display, channelling, shaping, manipulation, timing, localisation, integration, disregard, and deletion of information. (DeIuliis, 2015, p. 20) If one sticks with a simpler concept of gatekeeping, publishers keep the role of final gatekeeper, even if it has become only one of their many roles. It “will be found in some types of Internet publications, but not in others” (McQuail, 2010, p. 140). If this is the case, there are still promising possibilities to work with the definition of

2.4. THE MEDIA AND COMMUNICATION STUDIES PERSPECTIVE

51

‘gatekeeping’ as ‘forwarding or withholding information to, or from, a possible audience’. Hindman (2009) agrees that “[s]ome ways in which online information is filtered are familiar, as traditional news organizations and broadcast companies are prominent on the Web” (p. 13). Additionally, however, “[s]earch engines and portal Web sites are an important force, yet a key part of their role is to aggregate thousands of individual gatekeeping decisions made by others” (Hindman, 2009, p. 13). Therefore, “the Internet is not eliminating exclusivity in political life; instead, it is shifting the bar of exclusivity from the production to the filtering of political information” (Hindman, 2009, p. 13).

Figure 2.12: Schematic of a network gatekeeper in the form of a bridge between clusters, as found by

This conceptual shift does not necessarily mean, however, that there are no longer any gatekeepers. Jürgens, Jungherr, & Schoen (2011) analysed communication networks17 among politically vocal Twitter accounts preceding the German parliamentary elections of 2009. They found that they were small world networks – so that every node could be reached from every other node within a few hops – but that information could not spread unhindered to every node. They identified accounts at positions where, if they did not exist, the network as a whole would be disconnected. They could show that these ‘new gatekeepers’ were “able to block, or at least severely hinder, selected political information from reaching the whole conversation network and thus from achieving public prominence” (Jürgens et al., 2011, sec. 1); that is, they functioned as information brokers by serving as bridging nodes in a network, as illustrated in fig. 2.12. At the same time, they showed, by means of the hashtag use of the accounts, 17

@mention and retweet networks

52

CHAPTER 2. LITERATURE REVIEW

that all of those ‘new gatekeepers’ had a bias towards or against one or more of the political parties involved in the election. Add to that the heavy influence of a user’s structural position in the conversation network on the visibility of her opinion and we have a situation in which personal opinions of Twitter users are heavily amplified by the technological bias of the structure of conversation networks on Twitter (Jürgens et al., 2011, sec. 5). In summary, the over-half-a-century old idea of gatekeeping has not lost its importance; rather, it actually again becomes a crucial concept for understanding public communication processes in a networked public sphere. However, to aid this understanding, we need to use network science methods: first, as seen in the example of Jürgens et al. (2011), to identify the gatekeepers in media networks; and, second, to measure their influence on the surrounding network, and on the information flowing through it.

2.4.2.4

One-, Two-, and Multi-Step Flows

Gatekeeping alone only affects the access to information, but cannot itself explain the influence and effects on the recipient. For this explanation, we also need to trace the dissemination of information after it passes the gatekeeper. Closely connected to the concept of gatekeeping, is one of the oldest statements about structure in the public sphere: the two-step flow of communication hypothesis (fig. 2.13, right path). This hypothesis “states that personal influence exercised by other people normally plays a more critical role in everyday decision making than information obtained from mass media” (Davis, 2009, para. 1). This contrasts with the one-step flow hypothesis (fig. 2.13, left path) that assumes a strong direct effect of mass media (Davis, 2009, para. 1). Both concepts have to be reconsidered due to the reshuffled structures and hierarchies of a networked public sphere. The idea of a two-step flow has its origins in Lazarsfeld et al. (1944), who identified ‘opinion leaders’ to whom information from news media flows first. In the second step, these opinion leaders (who are also gatekeepers in some way) pass on information to the less active population (McQuail, 2010, p. 473). Further research could confirm

2.4. THE MEDIA AND COMMUNICATION STUDIES PERSPECTIVE

53

Mass media

influences

Opinion leader

Voter/ consumer/ citizen …

ONE-STEP-FLOW

TWO-STEP-FLOW

Figure 2.13: Schematic of the one-step and two-step flow hypothesis

the importance of personal contact and conversation but “has not yet clearly shown that personal influence always acts as a strong independent or counteractive source of influence on the matters normally affected by mass media” (McQuail, 2010, p. 473). Nevertheless, findings indicate that opinion leaders act as gatekeepers in a more horizontal way than one might assume. They mostly influence their social peers or people directly below them in a social hierarchy (Davis, 2009, para. 2), moderating and explaining newsworthy content to the public. However, Habermas (1962, pp. 315– 316, 355) came to the conclusion that, even if they are more active and reflexive than the general audience, their opinion is not ‘public’ in the way that mainstream media content is public. Furthermore, as opinion leaders are entangled in a web of social

54

CHAPTER 2. LITERATURE REVIEW

expectations, he suggests that their opinion tends to self-reinforce in habitual rigidity (Habermas, 1962, pp. 315–316). The notion of a two-step flow of influence was further challenged by inconsistent findings about the pathways that information takes through society. This challenge inspired a number of multi-step models. However, as these models could be so easily adjusted to every possible mechanism of influence, it was hard to disprove them (Davis, 2009, para. 7). Complicating the matter further, the web led to the emergence of a new form of opinion leaders on blogs and social media, who could make their opinions public. This development no longer matches Habermas’ analysis of the public sphere in 1962. Even though Highfield (2011, p. 92) agrees that they are unlikely to change each other’s opinions, he sees these new opinions leaders as possible opinion leaders for their audience. To find the most influential bloggers in the Australian and French blogospheres, he examined the citation and link networks among them. The same has been done with Twitter retweet and @mention networks, as seen below. During the emergence of mass media, the two-step flow theory marked one of the milestones in understanding a very basic network structure of influence and news diffusion. Network science could not only help in the discovery of opinion leaders; it could also deliver an explanation for the contradicting results on one-, two-, and manystep flows, and for how and why there is most likely no single mechanism of influence. It could achieve this by enabling the drawing of a more detailed picture of the actual information flows. Thus, one-, two-, or many-step models stand exemplarily for all other concepts presented in this section: news diffusion; audiences, publics, and communities; and gatekeeping. While being cornerstones of classical media and communication theory, these concepts are anything but outdated: They are still fundamental building material for a new, updated model of a networked public sphere. Furthermore, as these concepts are translatable into network concepts, network science methods can provide empirical evidence for such a new public sphere model.

2.5. THE NETWORK SCIENCE PERSPECTIVE

55

2.5 The Network Science Perspective The schematic drawings in sec. 2.4 do not resemble networks by chance. Rather, they illustrate the suitability of many mass communication issues that can be translated into network science problems. There is a plethora of existing network science methods that could be used to address these kinds of problems. Impressive reviews and textbooks that comprehensively catalogue network science methods are also available. While I cite such reviews when appropriate, this study does not seek to present its own comprehensive catalogue. Rather, it focuses on exploring the possibility of applying these network science methods to media studies in general; and, in particular, it introduces concepts and methods that are of particular interest in the investigation of the diffusion of information in, and the structures of, a networked public sphere. Network science itself seems to many to be a fairly new field; in fact, however, its roots reach back centuries, as later seen in sec. 2.5.1. Understanding the disciplinary divide within the field itself (which is described in this section), is important to understand some of the challenges to merging network science and mass communication theory. Following the discussion of this divide, I give a brief overview of some models of networks and their growth in sec. 2.5.2. This is necessary background knowledge to understanding community detection algorithms, which are employed in the second study in chapter 6. Sec. 2.5.3 gives insights into, and an overview of different approaches to community detection. It shows how a disconnect between algorithm development and an understanding of different notions of community, can lead to questionable results. This disconnect is also one of the main motivations for the second study. In sec. 2.5.4, I summarise a selection of work that models, analyses, and predicts the contagion (or diffusion) of information, news, behaviour, emotions, and so on, on a network. This summary is crucial to understand the discussion arising out of the first study regarding the qualitative differences in the diffusion of two hashtags and a link in chapter 5.

2.5.1 A Brief History of Network Science In his 1948 seminal paper Science and Complexity, Warren Weaver, the then director of the natural science division at the Rockefeller Foundation, “explained the three eras that according to him defined the history of science” (Hidalgo, 2016, p. 1):

56

CHAPTER 2. LITERATURE REVIEW • the era of simplicity, dealing with problems “in which one can rigidly maintain constant all but two variables” (Weaver, 1948, p. 536). These problems were, therefore, solvable with simple calculus (Hidalgo, 2016); • the era of disorganized complexity, dealing with “problems that can be described using averages and distributions, and that do not depend on the identity of the elements involved in a system, or their precise patterns of interactions” (Hidalgo, 2016, p. 2). Examples are: the thermodynamics of a gas; the audience size of a TV show; or the determination of insurance premiums, “where the individual event is as shrouded in mystery as is the chain of complicated and unpredictable events associated with the accidental death of a healthy man” (Weaver, 1948, p. 538); • and, finally, the era of organised complexity: This was a new science focused on problems where the identity of the elements involved in a system, and their patterns of interactions, could no longer be ignored. This involved the study of biological, social, and economic systems. According to Weaver, to make progress in the era of organized complexity, a new math needed to emerge. (Hidalgo, 2016, p. 2) From Weaver’s 1948 perspective, this era ‘of organized complexity’ was about to

begin. It was necessary to map a region hitherto overlooked in the scientific landscape. Regarding the first two eras, Weaver (1948) was “tempted to oversimplify, and say that scientific methodology went from one extreme to the other – from two variables to an astronomical number – and left untouched a great middle region” (p. 539). The problems in this region are “problems which involve dealing simultaneously with a sizable number of factors which are interrelated into an organic whole”, but which “cannot be handled with the statistical techniques so effective in describing average behavior in problems of disorganized complexity” (Weaver, 1948, pp. 539–540). Part of the “new math” was the science of networks, “a clear response to Weaver’s request” (Hidalgo, 2016, p. 2). Graph theory had existed since Leonhard Euler’s Seven Bridges of Königsberg in 1763, at least (Biggs, Lloyd, & Wilson, 1976).

2.5. THE NETWORK SCIENCE PERSPECTIVE

57

However, its application had a renaissance. As “mathematical objects that help us keep track of the identity of the elements involved in a system and their patterns of interaction”, networks are “ideal structures to describe problems of organized complexity” (Hidalgo, 2016, p. 2). However, “the science of organized complexity was born fragmented, with pioneers in many different fields” (Hidalgo, 2016, p. 2). It “emerged in parallel efforts that are not easy to reconcile” (Hidalgo, 2016, p. 3). Hidalgo (2016), “using a thick brush” (p. 3), paints a simplified, yet instructive scenery of two streams: • one flowing around the disciplinary borders of social sciences, political sciences, and economics (which is subsumed under “social sciences” in the following); • the other having its source in computer sciences, physics, mathematics, and biology (referred to as “natural sciences” for the remainder of this subsection). However, these streams did not flow towards the same ocean. Hidalgo (2016) attests that they had inherently “diverging academic goals” (p. 1): Scholars trained in the social sciences focus on explaining social and economic phenomena, and are interested on [sic] how networks affect the individuals and organizations forming these networks (demographics, income, etc.). (Hidalgo, 2016, p. 11) This contrasts with the goals of the other stream of network science: Natural scientists, on the other hand, are interested in identifying features that are common to a wide variety of networks, and hence focus on the use of stochastic and generative models that are agnostic about the properties of individuals, or their goals. This pushes natural scientists to focus on what different networks have in common, instead of what sets them apart. (Hidalgo, 2016, p. 11) For example, for social scientists, links were often “social relationships that are meaningful only as long as the individuals involved in them trust and support each other in specific ways” (Hidalgo, 2016, p. 4). Natural scientists, in contrast, mostly preferred a definition “more abstract and driven by the availability of data”, and one

58

CHAPTER 2. LITERATURE REVIEW

involving “recorded acts of communications that are independent of social context” (Hidalgo, 2016, p. 4). Apart from the notion of a link and its implications for the collection of data, the applications of network science also differed due to differences in academic objectives. Social sciences, for example, naturally used network science to explain phenomena on the micro- to meso-scale. How, for example, does homophily lead to ethnic segregation and, in combination with the importance of social ties in the labour market, to economic imbalances? Or, why do more trustful societies perform better by macroeconomic standards (Hidalgo, 2016, pp. 7–9)? Natural scientists, on the other hand, focused mainly on five things: (i) explaining the topology of networks in terms of stochastic models, (ii) developing algorithms to quantitatively describe the topology of networks, from their degree distribution to their community structure, (iii) modeling the spread of diseases and information on networks, (iv) using networks as a mean [sic] to model large interconnected systems, by mapping connections among diseases, language, or similar products, and (v) to study the implications of network structure for game theoretical outcomes, not in the context of link formation, but primarily in the context of the evolution of cooperation. (Hidalgo, 2016, p. 10) Social scientists might ask, however: “What questions can natural scientists answer with their context agnostic approaches” (Hidalgo, 2016, p. 10)? Hidalgo (2016) lists the analysis of the vulnerability of networks and link prediction methods, where the latter “is a good example of a disconnection between the literatures advanced by natural and social scientists” (p. 10). While often depending on the closure of open triangles of relationships in a network (triadic closure), they often do not cite the social science literature on triadic closure. Instead, they focus on comparing a repertoire of measures of open triads and machine learning algorithms in search for the combination of features and algorithms that maximize the accuracy of the predictions. (Hidalgo, 2016, p. 10) Being another example, community detection is also a field of great interest (especially for physicists) that is popular among social scientists. As we see in sec. 2.5.3, boundaries

2.5. THE NETWORK SCIENCE PERSPECTIVE

59

between disciplines can range from definitional issues – that is, regarding the term ‘community’ – to questionable applications of the algorithms developed. In any case, “while the mass media may have ‘discovered’ social networks”, they are nothing new: “What is relatively new are systematic ways of talking about social networks” (Kadushin, 2012, p. 18). Sociometry, by means of the sociogram – the ancestor of social network analysis (SNA) – dates back to the early 1930s (Wasserman & Faust, 1994, p. 11). Many scholars trace the first mention of the term “social network” to Barnes in 1954 (Wasserman & Faust, 1994, p. 10). From then on, social network analysis evolved further “as an integral part of advances in social theory, empirical research, and formal mathematics and statistics” (Wasserman & Faust, 1994, p. 3) for more than 50 years. This research was mostly based in and relevant to social, political, and behavioural sciences. Topics include, i.a., the world political and economic system, community elite decision making, social support, community, group problem solving, the diffusion and adoption of innovations, cognition or social perception, markets, exchange and power, consensus and social influence, the sociology of science, and coalition formation (Wasserman & Faust, 1994, p. 6). Social network analysis found such a wide range of applications because it promised a way to build not only a new tool for the social sciences, but to pursue hitherto in the social sciences unexplored avenues of theory and scientific practice: Social network analysis provides a precise way to define important social concepts, a theoretical alternative to the assumption of independent social actors, and a framework for testing theories about structured social relationships. The methods of network analysis provide explicit formal statements and measures of social structural properties that might otherwise be defined only in metaphorical terms. Such phrases as webs of relationships, closely knit networks of relations, social role, social position, group, clique, popularity, isolation, prestige, prominence, and so on are given mathematical definitions by social network analysis. Explicit mathematical statements of structural properties, with agreed upon formal definitions, force researchers to provide clear definitions of social concepts, and facilitate development

60

CHAPTER 2. LITERATURE REVIEW of testable models. Furthermore, network analysis allows measurement of structures and systems which would be almost impossible to describe without relational concepts, and provides tests of hypotheses about these structural properties. (Wasserman & Faust, 1994, p. 17). However, research on the topics mentioned above involved small groups of

maybe thousands of actors the most, which is miniscule compared to what we can analyse today. Back then researchers were “usually forced to look at finite collections of actors and ties between them. This necessitates drawing some boundaries or limits for inclusion. Most network applications are limited to a single (more or less bounded) group; however, we could study two or more groups” (Wasserman & Faust, 1994, p. 20). This was in the first place due to the available means of collecting data. These were mainly limited to questionnaires, interviews, traditional observations, manual analysis of archival records, small-scale experiments, or also diaries and chain-letters (Wasserman & Faust, 1994, p. 45). But even if the massive datasets of today would have been available, the computational means to analyse them would not have been sufficient. In a sociological sense we start to reach the border of a finite dataset. Even though we have to define the type of relations, with enough resources and data availability assumed, it is possible today to analyse networks involving billions – and therefore the biggest possible group of people: the global human population. So, while network analysis is well-established in small-scale social science research for already over half a century, what is relatively new is the accessibility of data about complex communication and interaction networks, and the possibility of analysing them on a large, national, or even global scale due to the emergence of online media and online social networks. Therefore, network science now should become especially helpful for media and communication studies. However, Ackland, for example, laments that “the media-studies perspective on the web is often characterised by an absence of formal network techniques” (Ackland, 2013, p. 14). Furthermore, the literature summarised above in sec. 2.3 shows that he was not, until today, alone in this assessment. This does not mean that nobody talks about networks in media and

2.5. THE NETWORK SCIENCE PERSPECTIVE

61

communication, as seen in sec. 2.4.2. It does indicate, however, that the techniques to apply the maths for organised complexity in media and communication do need some development, as does communication among disciplines. This study fosters this interdisciplinary communication by giving equal weight to theory and knowledge about a networked public sphere from media and communication studies, and to a true understanding, explanation, development, and critical application of relevant network science approaches. This critical approach is crucial, because network science methods are often closely entangled with the theoretical constructs for whose empirical examination they have been developed. Many of the key structural measures and notions of social network analysis grew out of keen insights of researchers seeking to describe empirical phenomena and are motivated by central concepts in social theory. In addition, methods have developed to test specific hypotheses about network structural properties arising in the course of substantive research and model testing. The result of this symbiotic relationship between theory and method is a strong grounding of network analytic techniques in both application and theory. (Wasserman & Faust, 1994, pp. 3–4) This symbiotic relationship makes network science methods so appealing, but also bears dangers, as will be clear especially after reading chapter 6. Compared to the usual statistical apparatus known in media and communication studies, network science methods necessitate a stronger understanding of the underlying theory, concepts and definitions that motivated their development. This holds also true about the subject of the next section: models of network growth, which embody theories and models, such as the small-world-hypothesis, but are at the same time methods to create baseline measures or artificial networks for simulations, for example of the diffusion of information.

62

CHAPTER 2. LITERATURE REVIEW

2.5.2

Drawing the Baseline: Understanding and Modelling the Growth of Networks and their Properties

This section provides a classification of network types, and gives examples of how to explain and model their growth (following the structure chosen by Pastor-Satorras et al., 2015, sec. III.D., “Network classes and basic network models”). These generative models are not only crucial because real world data of networks are not always accessible; they are also helpful in isolating effects and controlling topological features for simulations (e.g., to understand the effects of actor-driven decisions at the micro-scale on the macro-scale properties of real world networks, or to get a baseline for comparisons). Therefore, a thorough understanding and overview of the most important generative network models is necessary to correctly interpret the results of many network analysis methods. Especially important is the understanding of some of the most important community detection algorithms, two of which are applied in the second study in chapter 6. This study depends on an understanding of these models – an understanding that can alter the interpretation of results in a fundamental way. Furthermore, the results of many studies that rely on simulations cannot be understood correctly if one does not have an overview of these models and their implications. One of the most basic models leads to random homogenous networks, described in a theoretical model, inter alia, by Erdős & Rényi (1960). This model can be easily constructed by starting with a set of nodes, and drawing a link between any two nodes with a fixed probability (Pastor-Satorras et al., 2015, p. 934). However, with the exception of a small maximum shortest path length (diameter) for an average number of links (degree) > 1, it does not resemble most real world networks. For example, its proportion of closed triplets to open triplets (clustering coefficient) is too low (PastorSatorras et al., 2015, p. 934). Nevertheless, it is used as a model for much research in diffusion dynamics on networks, some of which is summarised in sec. 2.5.4. A large class of real-world networks, however, are so called small-world networks (Watts & Strogatz, 1998). They exhibit a high clustering coefficient and a small diameter, because they are made up of densely connected clusters connected by shortcuts, also known as long ties. One way to model these networks is, for example, to 1. Take a ring lattice of nodes which are each connected with a fixed number of

2.5. THE NETWORK SCIENCE PERSPECTIVE

63

nearest neighbours, and to 2. Then reconnect a fraction of the neighbours of every node to a random node somewhere in the network. This leads to smaller, shortest path lengths, while the high clustering coefficient is mostly preserved. However, this model still “generates homogeneous networks where the average of each metric18 is a typical value shared, with little variations, by all nodes of the network” (Pastor-Satorras et al., 2015, p. 934). However, many organically grown networks do not exhibit the degree distribution that would emerge if their linking process was random. Empirical evidence from different research areas has shown that many realworld networks exhibit levels of heterogeneity not anticipated until a few years ago. The statistical distributions characterizing heterogeneous networks are generally skewed and varying over several orders of magnitude. Thus, real-world networks are structured in a hierarchy of nodes with a few nodes having very large connectivity (the hubs), while the vast majority of nodes have much smaller degrees. (Pastor-Satorras et al., 2015, p. 935) Often, the degree distribution of these heavy-tailed networks follows a power-law19

p(k) ∝ k −γ ,

(2.1)

where • p(k) is the probability of a node having k links, • γ is a parameter > 0 changing the slope of the curve, while • usually 2 ≤ γ ≤ 3. This means that the probability that a node has k links is inversely and exponentially proportional to the number of links k (see fig. 2.14). This holds true, for example, for intracell protein networks; the hyperlink network on the web; the Twitter follower network; or the web of human sexual contacts (Hindman, 2009, p. 41). Often, 18 19

e.g. degree or other centrality measures this also applies to the distribution of many other node properties

64

CHAPTER 2. LITERATURE REVIEW

these are referred to as ‘scale-free networks’ because, no matter at which order of magnitude we observe the slope of the degree distribution, the distribution’s shape stays the same. However, why do all these networks share this characteristic distribution? And how can we model them?

Figure 2.14: The degree distribution in a power-law network compared to a random network degree distribution. Adapted from Wikimedia Commons, public domain, https://commons.wikimedia.org/wiki/File:Complex_network_degree_distribution_of_random_and_scalefree.png)

One way is to use so-called ‘configuration models’, for which a degree distribution is ‘configured’ before generating the network. A simple generative model would be (Pastor-Satorras et al., 2015) to: 1. Assign a random number of link-stubs (i.e., degree) to every node, based on the desired probability distribution, in a way that the sum of all degrees gives an even number (so that there are no loose ends); and then 2. Randomly connect the nodes via their stubs. With this model, one can configure a power-law distribution. This distribution is often used to generate a null-model to calculate the modularity of a network. The maximisation of modularity is one of the most popular approaches to community detection in media studies today (see sec. 2.5.3 and sec. 6.4.1.1). A more complicated example of a configuration model is the stochastic block model (SBM) (Holland, Laskey, & Leinhardt, 1983), used especially for the study of community detection algorithms (see sec. 2.5.3), one of which is compared to modularity maximisation in the second study in chapter 6. This model works as follows (see, e.g., Peixoto, 2017a):

2.5. THE NETWORK SCIENCE PERSPECTIVE

65

1. Nodes are assigned to groups. 2. Between every two groups, a number of links is determined. 3. Optionally (i.e., in the ‘degree corrected model’), the degree sequence (i.e., the number of links for every single node) is also given. 4. According to these parameters, the links are distributed randomly between the nodes. While we could force a high clustering coefficient in SBMs, the simpler configuration model can only be used to reproduce the skewedness of the degree distribution: It does not lead to the clustering present in social networks. However, when “the network size grows as large as in Facebook, Twitter, or a mobile phone network, the average clustering coefficient in a simple random graph approaches zero and the geodesic distance [i.e. the shortest path] between any two vertices approaches infinity” (Wang, Lizardo, & Hachen, 2014, p. 2). Wang et al. (2014) presented and tested the possibilities of generating large scale-free clustered random graphs. These included their own graph that was based on combining network ‘motifs’ (i.e., micro-patterns containing the triangles between nodes that are so typical of social networks). Mechanisms like this are needed to keep the clustering coefficient in the desired range. The problem is that these mechanisms either require a lot of resources to compute the graph or lead to micro-structures that are artificial rather than random. In worst case they exhibit both problems. Even the proposed algorithm by Wang et al. (2014) only solves the first problem. While the above explained generative models are suitable for reconstructing certain properties of real world networks, they do not yield much explanation of how or why these properties emerge in the networks they are meant to resemble. Perhaps, for this purpose, it makes more sense to look at networks that grow by justified rules, rather than those that are randomly generated. Barabási (1999) proposed a model based on a simple preferential attachment rule for nodes in a network. In his model (following the summary by Pastor-Satorras et al., 2015, p. 935),

66

CHAPTER 2. LITERATURE REVIEW 1. We start with a small number of m0 connected nodes; 2. At every time step, we add a node with m links (m ≤ m0 ); 3. These m links are randomly connected to the ith node with a probability ki / where ki is the degree of node i, and

∑ j

∑ j

kj ,

kj is the sum of all node degrees already

in the network. This leads to a rich-gets-richer dynamic – as nodes with many connections are likely to accumulate even more connections – and to a scale-free network with a degree distribution following eq. 2.1. As many networks – such as the hyperlink network of the world wide web, its backbone structure of routers, or the follow network – exhibit a similar degree distribution, this model is often used for simulations of diffusion dynamics, such as those presented below in sec. 2.5.4. This is despite its lack of clustering. As this is one of the most prominent models from the natural sciences side of network science, it is also appropriate to highlight the ways in which natural scientists’ approaches offen differ from the social sciences approach. The sociological approach assumes that link formation is connected to the characteristics of individuals and their context. Chief examples of the sociological approach include what I will call the big three sociological linkformation hypotheses. These are: shared social foci, triadic closure, and homophily. (Hidalgo, 2016, p. 5) Natural scientists, on the other hand, often try to stay “agnostic about the characteristics of the individuals involved”. They often “model the evolution of networks as stochastic processes that tie the evolution of a network back to its structure” (Hidalgo, 2016, p. 5), as is the case for Barabasi’s model. “For many social scientists, however, preferential attachment would represent an incomplete explanation of link formation since their main interest would be to understand why [emphasis added] people want to connect to hubs”, while natural scientists often are mostly interested in universal symmetries and constraints (Hidalgo, 2016, p. 5). Hidalgo (2016) explains how the inquisitiveness of social scientists beyond the network structure can lead to complex, helpful insights: Ethnic segregation, for example, is not only driven by social network structure, but also by the abovementioned

2.5. THE NETWORK SCIENCE PERSPECTIVE

67

big three: shared foci, homophily, and triadic closure. They “give rise to homogenous self-reinforcing groups” (Hidalgo, 2016, p. 7). Due to the embeddedness of the labour market in the social structure (i.e., the fact that most jobs are somehow brokered via friends and acquaintances), ethnic differences could lead to different job opportunities (Hidalgo, 2016, p. 8). In the same way, it is to be expected that structures in social media networks cannot be explained by network structure alone, but that, to be accurate, models have to be enriched with theory and empirical findings from media and communication studies (for example, to account for the agency of their constituents).

2.5.3 Community Detection Community detection is a field where the frictional losses in communication between the social sciences or media studies perspective and the natural science perspective are very prominent. This is the main motivation for the comparison of the epistemological implications of two algorithms in the second study of this thesis in chapter 6. As seen in sec. 2.4.2.2, the definition of the term ‘community’ itself is problematic, as it is used for concepts that, apart from their name, do not have much in common. When it comes to community detection, ‘community’ is often simplistically thought of as a part of a network that is more densely connected than expected (Coscia, Giannotti, & Pedreschi, 2011, p. 513). However, despite community detection algorithms trying to maximise modularity being apparently the most popular20 , there are other approaches. Sadly, however, “[m]any approaches in the literature do not explicitly define the communities they want to detect” (Coscia et al., 2011, p. 518). Furthermore, most reviews “cluster the different algorithms according to their operational method, not according to the definition of community they adopt in the first place” (Coscia et al., 2011, p. 513). To shift “the focus from how communities are detected to what kind of communities are we interested to detect” (Coscia et al., 2011, p. 513), Coscia et al. (2011) undertook a particularly instructive review of extant community detection algorithms, classifying them according to their implicit definitions of ‘community’. They categorise community detection algorithms in the following, partly overlapping, classes: 20

also due to their availability in the easy-to-use network analysis software Gephi (Bastian et al., 2009))

68

CHAPTER 2. LITERATURE REVIEW • Structure definition: This is one of the oldest ways to detect communities. After having defined a strict structure as a community, such as groups of a certain number of nodes that are connected by a certain number of edges, the algorithm looks for those patterns. This often allows, for example, for overlapping communities, but is too rigid for most applications. • Closeness: Algorithms in this class follow a definition of ‘community’ that assumes that actors in the same community can reach each other via shorter paths than actors outside of their community. • Diffusion: Under the assumption of a contagion process on the network, nodes are considered to be in a community if they are able to influence each other’ or end up in the same state after the contagion process. • Bridge detection: This definition targets isolated communities. Groups of nodes that are easily disconnected by removing bridging nodes or edges are in separate communities. • Internal density: In the form of modularity maximisation, this is one of the most popular implicit definitions of ‘community’. It assumes nodes to be in the same community if, compared to some null model, there are more edges between them than expected (see sec. 2.5.2). • Feature distance: This is one of the more abstract definitions; nonetheless, it yields some very interesting algorithms. It relies on a defined distance measure based on features21 of the nodes. Nodes close to each other based on this distance measure (meaning, in most cases, that they have many features in common) are considered to be in the same community. • Link clustering: This definition differs fundamentally from the others as it sees links connected to the same entities as communities, while the nodes are then part of the communities of their links. • Meta clustering: The algorithms in this class do not have a clear definition of ‘community’ but combine several of the above approaches. These definitions might not seem to lead to very different outcomes for many

networks: Take, for example, closeness-based algorithms and diffusion-based defini21

which might be related to the network structure (e.g., to which other nodes they are connected), but also to other properties

2.5. THE NETWORK SCIENCE PERSPECTIVE

69

tions. Even for these, however, one can easily find simple examples for which they would return different community structures. Coscia et al. (2011) found fundamentally different results for algorithms from different categories that were applied to the network of Facebook friends of one of the authors. Considering the fact that many studies seem to use a community detection algorithm because it is easily available – without stating or considering the implications for the underlying definition and the results – the importance of choosing the right community detection algorithm for the respective application cannot be stressed enough. Furthermore, as seen in the second empirical study of this project in chapter 6, the choice of a certain community detection algorithm can have fundamental epistemological implications that affect not only results, but also inferred theory. Therefore, it is important to precisely distinguish between the different abstraction levels of definitions involved: Even though a community detection algorithm might have been developed with a certain application in mind, in its code it just contains a functional definition of what it does. It neither defines the meaning of its input nor of its output. This might be sufficient for the developers of such algorithms. However, as Coscia et al. (2011) show, there is an abstraction layer closer to the application domain, making it possible to define the categories of community detection algorithms mentioned above. This layer can then be mapped to the least abstract definition of community. This definition is the one relevant to sociologists or media and communication scholars, for example the definition of an issue public, or an echo chamber. The former will be addressed in chapter 5, the latter in chapter 6.

2.5.4 How Things ‘Go Viral’: Contagion on Networks Community detection algorithms (as discussed above) can help in the investigation of one of the two problems that this thesis focusses on: the emergence of communities, micro-, meso-, and macro-publics within a networked public sphere. Meanwhile, the second problem – the diffusion of information in such a system – has also been the subject of much research effort in network science. Under a network paradigm, information, behaviours, emotions, or diseases spread among nodes or actors via links, usually represent some form of interaction among actors. This diffusion, spreading, and/or

70

CHAPTER 2. LITERATURE REVIEW

contagion22 of information and epidemics “interest both natural and social scientists” (Hidalgo, 2016, p. 10). This section provides background knowledge to motivate the analysis and interpretation of the findings of the first study of this thesis (sec. 5), which deals with the contagion of two hashtags, and a link to an online petition. This background also helps in the estimation of the implications of the clustered structure of the Australian Twittersphere, which is analysed in detail in the second study (sec. 6). It does this by providing an overview of the most important models of diffusion on networks (sec. 2.5.4.1); some examples of the empirical analysis of such contagion (sec. 2.5.4.2); and attempts to better predict the spread of items on social media (sec. 2.5.4.3).

2.5.4.1

Modelling of Contagion

The modelling of diffusion or contagion in a population was, until the availability of big social data, mostly done with epidemics of diseases in mind (Pastor-Satorras et al., 2015, p. 967). Classical mathematical models, not necessarily taking network structures into account, generally categorise a population into two to four classes (Pastor-Satorras et al., 2015, p. 928): • S(usceptible), • E(xposed), but not yet infectious, • I(nfected), and spreading the infection, • R(ecovered) and immune, either permanently or for a given timespan. Following this classification, the most prominent models have been the SIS, the SIR, the SIRS, and the SEIR(S), in which every member of the population – following certain rates or probabilities, and based on the prevalence of infected individuals – goes from one state to the other in the respective order of the acronyms. These first classical models assumed, as a first approximation, “random and homogenous mixing, where each member in a compartment is treated similarly and indistinguishably” (Pastor-Satorras et al., 2015, p. 930), neglecting any social network structure or individual characteristics. Nevertheless, they were suitable for understand22 Though I want to acknowledge the nuanced differences between the terms, they are used more or less interchangeably here.

2.5. THE NETWORK SCIENCE PERSPECTIVE

71

ing the basic mechanisms behind the spread of disease, such as their long-term behaviours, depending on certain rates and probabilities; for example, the “SEIR model is one of the paradigmatic models for the spreading of influenza-like illnesses” (PastorSatorras et al., 2015, p. 929). Of particular note (and this is what made these models especially interesting for statistical physicists), was the existence of phase transitions23 at critical values of control parameters (such as the infection probability or the recovery time), leading to an “abrupt change in the state (phase) of a system, characterized by qualitatively different properties” (Pastor-Satorras et al., 2015, p. 931); for example, from a phase where a disease stays locally confined, to a phase of sudden extinction of the entire population. These models have been the focus of much research across several disciplines. An extensive review of the developments in this area, particularly taking the lack of network effects and other limitations of the classical models into account, is given by Pastor-Satorras et al. (2015). These classical models, however, cannot be applied 1-to-1 on social contagion. Without the caution below, it might be tempting to apply these models on social contagion without further thinking: Some specific features of social contagion, however, are qualitatively different from pathogen spreading: the transmission of information involves intentional acts by the sender and the receiver, it is often beneficial for both participants (as opposed to disease spreading), and it is influenced by psychological and cognitive factors. (Pastor-Satorras et al., 2015, p. 968) Empirical analysis shows a more complicated picture of contagion than a germ with a more or less defined contagiousness (which is complicated enough). Therefore, models have to be adapted for social phenomena. Two possible adaptions are complex contagion models, also called ‘threshold models’ (Pastor-Satorras et al., 2015, p. 970), and ‘rumour spreading models’ (Pastor-Satorras et al., 2015, p. 970). Complex Contagion or Threshold Models

To understand more about how social

contagion does not only depend on the actors in a network, but also on the network structure – and, therefore, making network science so important – I now introduce some 23

originally a concept for the description of matter changing its state; e.g., a fluid becoming a gas

72

CHAPTER 2. LITERATURE REVIEW

theoretical thoughts about simple and complex contagion. A contagion is called ‘complex’ if the spread of an item to a new node needs this node to be connected to more than one already “infected” node24 . Explaining it the other way around, a simple contagion spreads through single connections; the more connections are needed, the more complex the contagion is. According to Centola & Macy (2007, pp. 707–708), this complexity can be influenced by • Strategic complementarity (e.g., competitors in a market waiting for others to try innovations first); • Credibility (e.g., journalists spreading a fact only if they have more than one independent source); • Legitimacy (e.g., concerns about sharing critical content regarding law and order; the government as a reason for complex contagion in political protest; or the contradiction of social norms); • Emotional contagion (e.g., emotions have to infect a certain threshold of people to cause a mass panic). A disease, for example, spreads simply through one contact with an infected person, while a behavioural modification to protect from the disease (such as avoiding handshakes or using condoms) spreads complexly through the influence of more than one peer (in most cases, because the behavioural modification was against social norms)(Centola & Macy, 2007, p. 730). Taking this phenomenon into account changes much of what has been assumed about the strength of weak ties: According to Granovetter (1973, p. 1366), “whatever is to be diffused can reach a larger number of people, and traverse a greater social distance, when passed through weak ties rather than strong”. This insight has become one of the most widely cited and influential contributions of sociology to the advancement of knowledge across many disciplines, from epidemiology to computer science. However, considering the concept of complex contagion shows “the need to circumscribe carefully the scope of Granovetter’s claim” (Centola & Macy, 2007, p. 703). Long weak ties can become literally weak in cases of complex contagions as they are not ‘broad’ enough. 24

Sometimes contagion involving mechanisms beyond the ‘simple’ mechanics of the spread of illnesses, especially social contagion, is also referred to as ‘complex’

2.5. THE NETWORK SCIENCE PERSPECTIVE

73

In the case of complex contagion, “bridges”, made of the necessary number of ties, are needed for an item to spread (Centola & Macy, 2007, p. 710). On Twitter, however, complex contagion is not just a theoretical construct: It has already been observed in the wild in the form of the spread of especially political hashtags (Romero, Meeder, & Kleinberg, 2011). Actually, complex contagion can provide a prediction method for the virality of content on Twitter. First, however, we need to understand another property of complex contagions: their propensity to lead to a critical mass phenomenon, a phase transition. Barash, Cameron, & Macy (2012) showed, analytically and computationally, the existence of a critical mass, a bifurcation point of infected nodes for complex contagion. If a contagion reaches this critical mass, it will spread through their entire model networks. They tested this on a rewired lattice, as well as on generated random power-law networks (see sec. 2.5.2) which, to a certain extent, resemble the structural properties of online social networks. Their model networks are still highly artificial, however, as they admit themselves that “much more research is needed before we can have confidence in the predicted existence of a bifurcation point in the propagation of complex contagions” (Barash et al., 2012, p. 461). They also state, however, that the “existence of a bifurcation point in the propagation of complex contagions has a potentially valuable practical implication for the ability to predict the eventual outcome at the early stages of a viral marketing campaign” (Barash et al., 2012, p. 460). As soon as a complex contagion manages to infect a certain number of clusters, it means that the contagion is ‘simple’ enough to spread to other clusters also; that the bridges in this network are wide enough; or that it has reached the critical mass so that even a non-infected cluster has enough infected neighbours from other clusters. “Simply put, once a complex contagion reaches critical mass, it begins to spread in the same way as a simple contagion—taking advantage of shortcuts to distant regions and eventually reaching every node in a connected network” (Barash et al., 2012, p. 460). One possibility to further extend threshold models is to allow for repeated activation. While threshold models are already closer to the perceived reality of many social contagion phenomena than are simple epidemic models, most threshold models still “assume that activation happens only once …. This is the reason why threshold models are appropriate to capture the first stage of coordination dynamics” (Piedrahita,

74

CHAPTER 2. LITERATURE REVIEW

Borge-Holthoefer, Moreno, & González-Bailón, 2017, p. 2). However, a more realistic assumption is that, after this first phase, repeated activation is relevant to, for example, keeping a hashtag regarding some social movement alive. Then “failure to trigger a chain reaction depends not only on the distribution of thresholds or the impact of network structure on activation dynamics; it also depends on whether the network facilitates coordination, that is, an alignment of actions in time” (Piedrahita et al., 2017, p. 2). Rumour Spreading Models

Rumour spreading models are based on the SIR model.

However, their recovery does not occur due to some node-internal process, but by interaction with other nodes: “If the spreader finds that the recipient already knows the rumor, he or she might lose interest in spreading it any further” (Pastor-Satorras et al., 2015, p. 970). Susceptible actors are interpreted as ignorant, infected actors are spreaders, and recovered individuals are called stiflers. The ratio of stiflers to the total number of nodes in the long-term outcome of this model indicates whether a rumour stayed localised or went global (Pastor-Satorras et al., 2015, p. 970). Interestingly, on scale-free networks25 , it has been shown that the heterogeneity of the degree distribution hinders the spread of rumours: “large hubs are rapidly reached by the rumor, but then they easily turn into stiflers, thus preventing the further spreading of the rumor to their many other neighbors” (Pastor-Satorras et al., 2015, p. 970). Modelling Influence

Limitations of the rumour spreading model in describing real

rumour spreading have been highlighted by the fact that while simulations suggest no heightened impact of nodes being privileged by their position in the network26 , empirical evidence based on Twitter data speaks against this (Pastor-Satorras et al., 2015, p. 971). Hence, the model has been extended by Borge-Holthoefer et al. (2013b) in two ways: It is either assumed that recipients are not always active, or a high probability of their turning directly into stiflers is introduced. Both these extended models deliver results that qualitatively match the evidence on Twitter (Pastor-Satorras et al., 2015, p. 971). However, the hypothesis that “influentials – a minority of individuals who influ25 26

i.e., resembling the model by Barabási (1999), see sec. 2.5.2 measured by their k-coreness

2.5. THE NETWORK SCIENCE PERSPECTIVE

75

ence an exceptional number of their peers – are important for the formation of public opinion” (Watts & Dodds, 2007, abstract) was closely related to the two-step flow hypothesis27 , was not always uncontested. Watts & Dodds (2007), simulating a simple SIR and a threshold model of contagion on a variety of artificial networks (i.e., high variance and low variance of random degree distribution, with community structures and without), found: First, “ordinary” influentials of the kind considered in low-variance networks appear to be important as initiators of large cascades where the threshold rule is in effect and in conditions where these cascades are only marginally possible …. They do not, however, play important roles as initiators under most conditions of the threshold model, under any conditions as early adopters, or when the SIR model is in effect. Second, “hyperinfluentials” of the kind that arise in high-variance networks are important as initiators under a wider range of conditions than ordinary influentials but still only when the threshold rule is in effect. Third, when the SIR rule is in effect, hyperinfluentials play an important role as early adopters, when networks are sufficiently sparse, but not as initiators. Finally, group structure appears to generally impede the effectiveness of influentials both as initiators and early adopters. (Watts & Dodds, 2007, p. 453) Therefore, they conclude: Under most conditions, we would argue, cascades do not succeed because of a few highly influential individuals influencing everyone else but rather on account of a critical mass of easily influenced individuals influencing other easy-to-influence people. In our models, influentials have a greater than average chance of triggering this critical mass, when it exists, but only modestly greater, and usually not even proportional to the number of people they influence directly. (Watts & Dodds, 2007, p. 446) Nevertheless, the findings by Watts & Dodds (2007) are based on models, and they see their results simply as a hint for the need for empirical research. Indeed, their 27

see sec. 2.4.2.4

76

CHAPTER 2. LITERATURE REVIEW

conclusion should be reinterpreted in the light of online social media: The conditions under which they found a crucial role of influencers, while from their perspective unrealistic, are actually the conditions found in social media networks. Most communication networks on the web exhibit a high variance structure; therefore, hubs or “hyperinfluentials”, which they saw (in 2007) as “more a theoretical possibility than an empirical reality” (Watts & Dodds, 2007, p. 454), are rather sparse (see Ugander et al., 2011 for Facebook; see Myers et al., 2014 for Twitter), and do not necessarily show a threshold behaviour. In any case: It is important to note that contradicting simulation results do not cancel out at the theoretical level. Incompatible outcomes (i.e. that influencers exist, or that they do not) simply highlight the fact that all models recover real phenomena only partially. (Borge-Holthoefer et al., 2013a, p. 11) Therefore, the question of whether “there is a small subset of special individuals who, given their centrality in the network, can influence a disproportionate number of others; or influence accumulates through the smaller networks of a critical mass of less central people” (Borge-Holthoefer et al., 2013a, p. 18) – as, for example, Barash et al. (2012)’s findings about complex contagion suggest (see sec. 2.5.4.1) – cannot be answered with models alone. The answer will likely differ by communication platforms, content, or context; and needs to be answered empirically with real-world data. However, while much research on contagion and virality has been done in the form of artificial models and simulations, empirical research is often bothersome. This is, for example, due to problems with data access, noisy data, or arcane platform mechanisms. It also explains that studies, such as the first study of this thesis in chapter 5, which try to connect these concepts with real world data, are rarely found.

2.5.4.2

Empirical Analysis of Contagion in Online Media Environments

Due to the availability of social media data, the analysis of the contagion of information, emotions, and opinions, is a quickly growing field in a number of disciplines. To present

2.5. THE NETWORK SCIENCE PERSPECTIVE

77

all available studies here would, again, be beyond the scope of this thesis, hardly manageable, and not particularly useful. Some examples – dealing not only with contagion but with contagion in conjunction with other network phenomena and within a broader context, but also (and especially) connecting back to media and communication theory – have already been discussed in sec. 2.2. However, with a focus on network science alone, it is useful to consider that even though network analysis is a well-established field and provides well-established network measures, there can still be reasons to reinterpret old measures from other fields. This was done by Goel, Anderson, Hofman, & Watts (2015), who proposed the Wiener Index, a measure actually used since the late 1940s in mathematical chemistry to measure what they call the structural virality of an information diffusion cascade28 . In fact, the Wiener Index is simply the average shortest path length between all nodes. But this measure of structural virality seems to be able to distinguish sheer popularity from a really viral 29 phenomenon that is driven by word-of-mouth mechanisms. Using their structural network descriptor, with a representative set of cascades from Twitter, Goel et al. (2015) can give evidence that pure popularity (i.e., the number of shares) is not to be taken as equivalent to this notion of virality. We test this measure and further consider its suitability to assess the virality of diffusion cascades in chapter 5. It is also possible to find influences of real world events on the diffusion of digital talk on Twitter. An examination of the evolution over time of standard Twitter metrics by Val, Rebollo, & Botti (2015) – such as the number of retweets or @replies, and of basic network measures such as the diameter or the clustering coefficient – could find significant differences in the temporal evolution of user behaviour and network structure among four categories of events: TV-shows, socio-political events, keynotes, and conferences. This shows that even non-adapted basic network metrics could give insights into the qualitative nature of the content being spread. The results in chapter 5 confirm this. 28

i.e., the network that emerges if we draw connections between every source of some information and the adopter of this information from this source 29 the definition of this term needs some work as we see in the study regarding contagion in chapter 6

78

CHAPTER 2. LITERATURE REVIEW

2.5.4.3

Prediction of Contagion in Online Media Environments

One test of understanding the diffusion of content online through a network, and one of the main motivations behind much diffusion research – as undertaken in this project and described in chapter 5 – is the attempt to predict the popularity or virality of an item. This section summarises some selected examples that show that it is possible to improve predictions, especially when the focus is on network structures. Inspired by the possibility of exploiting the phase transition predicted by Barash et al. (2012) (see sec. 2.5.4.1), Weng, Menczer, & Ahn (2013) started to look at the spread of hashtags on Twitter, and succeeded in improving the prediction of their popularity. By taking the number of infected clusters and concentration measures of the clusters into account, they were better able to train a machine-learning algorithm to predict the popularity of a hashtag. Some research claims, however, that the size of information cascades is inherently unpredictable (Cheng, Adamic, Dow, Kleinberg, & Leskovec, 2014). Nevertheless, analysing sharing cascades of pictures on Facebook, Cheng et al. (2014) suggest that if the prediction is reframed as a continuous tracking task while the cascade grows, prediction is easier. Rather than thinking of a cascade as something whose final endpoint should be predicted from its initial conditions, we think of it as something that should be tracked over time, via a sequence of prediction problems in which we are constantly seeking to estimate the cascade’s next stage from its current one. (Cheng et al., 2014, p. 925) Therefore, Cheng et al. (2014) analysed the cascade’s features while it grew, not only the content that was shared. They reframed the question ‘Will this content spread to a certain number of recipients?’ as ‘Will this cascade, at a certain size, double its size?’ By doing so, they constructed a surprisingly robust prediction algorithm, the predictive quality of which increased with the number of reshares that had been observed so far. At the same time, they confirmed that the content alone could not predict cascade size; however, with data about the initial shares – such as the initial cascade structures of the first three reshares – it was easier to predict. Looking at a vast set of other predictors (temporal predictors, graph structure, properties of accounts/pages sharing, properties of content), including the above ex-

2.5. THE NETWORK SCIENCE PERSPECTIVE

79

plained structural virality measure, they could predict whether an ongoing cascade of an image on Facebook would double in size with a surprisingly strong and robust performance of their prediction algorithm (based on machine learning classifiers). This shows that cascade dynamics have predictable properties. However, the machine learning approach lacks a theoretical model and, therefore, does not deliver explanatory insights. At the same time, even if the researchers had access to the data they used from Facebook itself, Facebook remains a black box, tweaking the virality of cascades by its proprietary algorithms. Nevertheless, their construction of the prediction as an ongoing tracking task is a concept worth following, and shows that network structure plays an important part in this prediction effort.

2.5.5 Some Steps Further: Multilayer Networks Until now we have considered only a very simple kind of network. The same kind of nodes are connected by the same kind of edges in one structure that we call ‘the network’. However, real networks, we have to admit, do not look like that. Humans are associated with more than one account, most likely not only Twitter. On these accounts, they post mentions of other accounts. These posts contain links to websites, maybe a hashtag. This hashtag potentially connects these posts with other posts. The websites themselves can be networks again, as they contain links to other websites. And so on. This is where multilayer networks come in. Because single-layer networks have already been complicated enough to raise questions for more than half a century, they are the subject of a quite recent, but quickly growing number of studies. All methods established on single-layer networks have to be reviewed and adapted, and completely new methods appear to be necessary. This leads to an “exploding body of recent work in this area” (Kivelä et al., 2014, p. 252). “Multilayer structures induce new degrees of freedom […] but they remain poorly understood” (Kivelä et al., 2014, p. 252). However, recent and ongoing research (especially in mathematics, physics, engineering and computer science) – on the structure and dynamics of (Boccaletti et al., 2014); the spreading processes on (Salehi et al., 2014); and the formation of multilayer networks (Magnani & Rossi, 2013) – will hope-

80

CHAPTER 2. LITERATURE REVIEW

fully also inform social sciences and (especially) media research. Already, multilayer, or multiplex, networks are a topic in both ‘streams’ of network science, and bring a qualitative dimension to the natural sciences. However, “this has not brought the study of networks by natural scientists closer to the literature advanced in the social sciences, since the focus has been primarily on the generalization of network measures … and on the mathematical implications for robustness and fragility” (Hidalgo, 2016, p. 4). Also, it has rather increased the fragmentation among disciplines: as it often happens when research areas are extremely heterogeneous and are being developed autonomously by scholars as different as physicists, computer scientists, and sociologists, it is not easy to identify a single clear direction for future research, and even trying to provide a frame to what exists poses serious challenges. (Dickison et al., 2016, p. 169) However, the importance of the concept of multilayer networks for media studies is obvious. A modelling of communication in multilayered networks depicts reality more appropriately and, therefore, when working empirically, makes the lack of data about the actual pathways that contagion takes even more obvious (Dickison et al., 2016, p. 150). For example, due to additional layers and, therefore, more links, the speed of contagion is most likely increased (Dickison et al., 2016, p. 159). In summary, while not addressed in this study, the concept of multilayer networks is clearly a direction to follow in building a better understanding of our networked media environment.

2.6

Conclusion

Fake news, filter bubbles, and echo chambers (as described in sec. 2.1.1 and sec. 2.1.2) are most prominent examples of phenomena that seem to be significantly affected by the emergence of a heavily networked public sphere. While these constructs have gained much attention during recent years, they actually stand for two categories of problems which, from a network perspective, are crucial to understanding mass communication in the current networked public sphere: problems of mostly short-term, ephemeral diffusion dynamics; and questions regarding the mostly long-term, slowly changing

2.6. CONCLUSION

81

network structures that enable these dynamics (e.g., follow or friendship networks on social media platforms). For both classes of problems, social media data is a rich source of empirical evidence that is subject to approaches by researchers from a broad range of disciplines (see sec. 2.2). This research, however, is hindered by a number of challenges (see sec. 2.3). Rooted in their nature of ‘organised complexity’, the relevant questions impose technical difficulties on researchers who are often not equipped to handle them; on the other hand, those who are able to overcome the technical challenges often lack the theoretical background and the domain knowledge to approach and interpret these problems and their respective findings in a useful way: that is, a way that either furthers our understanding of a networked public sphere, or leads to practical advice for media practitioners or policy makers. One of the most fundamental challenges to further developing an empirically driven and practically useful theory of a networked public sphere, therefore, is to overcome an existing disciplinary divide between natural, mathematical, and computational sciences on the one hand, and social sciences and humanities on the other. This challenge needs a common language, for which concepts from network science and programming are promising candidates. It also necessitates recognition of the epistemological differences among the disciplines involved, and finding approaches to bridge these differences. First, however, a basic mutual understanding of the relevant theories from mass communication theory on the one hand, and network science concepts on the other, is needed. This understanding is within the scope of this thesis and has been explored above. While public sphere theory is already a rich repository of hypotheses and models, network science can provide even more ideas and (especially) methods to test and extend these. Some researchers work in, and are equally interested in both, public sphere theory and network science, and find an intellectual home in newly emerging disciplines such as computational social sciences. Nevertheless, theoretical and methodological work from both fields regarding the problems relevant to understanding the networked public sphere today is still rarely published together at the same level of detail and expertise. This thesis gives equal weight to both. It provides ways to investigate problems of public communication – particularly, ephemeral diffusion dynamics and

82

CHAPTER 2. LITERATURE REVIEW

long-term network structures on a national scale – with network science methods. At the same time this investigation is motivated by, and interpreted with, extant mass communication theory from media and communication studies. From a media and communication studies perspective (sec. 2.4), the public sphere has already been conceptualised to be rooted in networks. The question remains whether it develops into an expanding universe with galaxies isolated by the void between frontiers of a polarised public opinion, or into a system of metropoles connected by broad trading routes that connect marketplaces of information, ideas, and opinions (sec. 2.4.1). As stated above, this is a question that can be formulated as a question regarding long-term network structures and short-term information diffusion. Parts of the answer might be sourced from existing theories and research traditions that have already implicitly talked about network structures, or can be interpreted as having done so (sec. 2.4.2). Diffusion research, for example, has already provided promising means and data that can surface the networked nature of news diffusion. In the case of audiences, publics, and communities, new opportunities similarly make the respective networks of connections and their communication visible and measurable. Research on community detection plays a major role in this field; however, it definitely needs to be used in conjunction with theory about what actually constitutes a community, public, or audience. At the same time, theorists would do well to engage with the technicalities of community detection algorithms to gather evidence to support their theoretical constructs. Gatekeeping as a concept is questioned because of produsers who are able to bypass the former gatekeepers. Meanwhile, gatekeeping processes are network processes and, therefore, operationalisable with network science methods. The same holds true for the two-step flow hypothesis. The concept becomes more complicated due to increasing degrees of freedom in media mechanics; at the same time, however, we can translate it into the language of a network paradigm and give measurable, quantified evidence for or against it. The network science perspective models the public sphere(s), theorised in media and communication studies, as a complex network. Evolved as an answer to the problems of organised complexity, network science, broadly speaking, gave birth to two streams in the natural sciences and social sciences. However these do not join up often enough (sec. 2.5.1), due to diverging goals and organisational boundaries among the

2.6. CONCLUSION

83

related academic disciplines. Nevertheless, network science offers a Swiss army knife of methods, models, and theories for the empirical and theoretical examination of public communication today. Network models provide a baseline for comparisons between empirical findings, and help to explain the links between micro-scale behaviours, meso-scale structures, and macro-scale phenomena. Community detection is a waterhole which both species, social and natural scientists, visit; however, it seems as if they rarely meet eye-to-eye. This leads to problems with the definition of ‘community’ which, in turn, lead to a lack of clarity about what community detection algorithms should detect and whether they actually do detect communities. This definition problem is an example of the entanglement of theory and method in network science methods. These methods provide not only analytical tools, but implicitly contain and embody theoretical constructs, for example, in the functional definition of a community detection algorithm, of a centrality measure, of a network growth model or of a model for information diffusion and contagion. Illustrating this, contagion was first mainly examined to describe epidemics of illnesses in the natural sciences. However, social contagion shows features more complex than the flu. This complexity has led to more suitable threshold and rumour-spreading models that integrate human motivations for social interaction; however, there is still room for improving these models. Social contagion is necessary for influence, and this closes the circle back to theories about opinion leadership and two-step flows – on which most of the empirical, data-driven analysis that tries to engage with theory still seems to focus. The attempts to predict contagion, on the other hand, show the difficulties in understanding it: difficulties caused by its inherently chaotic nature. Nevertheless, there are some promising approaches to overcoming these difficulties. Finally, the concept of multilayer networks makes more realistic models possible. It extends the space of possible methods that might be suitable for finding evidence for testing or extending theory. Nevertheless, this concept complicates matters further, not least because, again, it has developed in disconnected disciplinary advances: “It remains true, however, that science is an almost overwhelming illustration of the effectiveness of a well-defined and accepted language, a common set of ideas, a common tradition” (Weaver, 1948, p. 543). Therefore, “we need to help translate the research goals and

84

CHAPTER 2. LITERATURE REVIEW

intentions of one group of researchers to the language of the other (or at least, into a simple language that everyone can understand)” (Hidalgo, 2016, p. 3). This project is designed to take another step towards this mutual understanding.

Chapter 3

Objectives As already apparent from the literature review, large scale data about public communication can enable researchers to back theory about the dynamics and structures of national or (even) global public spheres with empirical evidence. At the same time, these data can lead to advances not only in theory, but also in practice (e.g., in providing new metrics for media production and consumption); they can guide public policy (e.g., in providing evidence for or against the existence and effects of filter bubbles, echo chambers, and fake news in the public discourse); and can provide strategies to address these kinds of problems. There is also especially great potential to translate existing theories and hypotheses into the language of network science to understand and further theorise the organised complexity found in public communication settings. Finally, network science concepts and measures can be used as a source of new ideas and evidence for mass communication theory. All this potential is difficult to unlock, however. It involves two main fields, network science and media and communication studies – fields that are themselves divided into disciplinary subfields and research cultures, with differing education, mindsets, incentives, and goals. Often, they are also further isolated by institutional and organisational boundaries. More practically, there are technical difficulties that occur in analysing big data; in this case, complete datasets about communication on a national scale. Many network analysis approaches require computing resources that grow exponentially with the size of the dataset. As this kind of analysis is often unprecedented, the process of analysis itself becomes the subject of experiment – and these experiments

85

86

CHAPTER 3. OBJECTIVES

can fail. Due to these disciplinary boundaries and technical difficulties, the time and effort that is needed to tap into the possibility of combining network science and media and communication theory appears daunting. Therefore, research is needed to address the methodological barriers; build a deeper understanding of the implications of applying these methods within the context of public sphere theory; and to undertake the needed experimentation with these methods themselves. Consequently, the main goal of this project is to explore network science methods, while bearing in mind the mass communication theory that is relevant to a networked public sphere. The insights gained will be presented in a way that enables a researcher from the field of media and communication studies to start analysing questions about the structures and diffusion of information within the public sphere from a network science perspective. At the same time, a researcher from network science will gain an entry point to the understanding and further exploration of relevant media and communication theories. Relevant theories from this field that are especially prone to profit from network science approaches are theories about a public sphere being rooted in networks, about audience fragmentation, about a networked public sphere made up of overlapping, hierarchical sub-structures, as described in sec. 2.4.1, and theories about relational structures in the public sphere, for example regarding the diffusion of news and information (see sec. 2.4.2.1), (networked) audiences, such as issue publics (see sec. 2.4.2.2), theories building on the concepts of gatekeeping and gatewatching (see sec. 2.4.2.3), or, building on an equally old tradition, theories rooted in the concepts of one-, two- and multi-step flows (see sec. 2.4.2.4). Thus, rather than addressing a single research question in the field of media and communication studies, the project is focused primarily on gaining methodological insight through method exploration and testing, and on the creation of new and useful methodological procedures. Empirical results that test established theories are, however, also an outcome of the project, as both are necessary to validate the utility of the methods that are explored and developed. With this in mind, the (intentionally) broad questions that guide and frame the project are as follows: 1. How can we interpret established social sciences and media and communication

87 concepts and theories of public communication as theories of the structural dynamics of networks? This question has already been partly addressed by the examples of this form of interpretation given in the literature review, especially in sec. 2.4. It is further revisited in both studies in chapter 5 and chapter 6, and reflected upon in the discussion in chapter 7. 2. How can we apply network science methods to analyse the structure and the dynamics of networks in online public communication at a scale of whole societies, and from a perspective that builds upon, rejects, or extends traditional media and communication theory? This question is mainly investigated through the studies in the empirical part of the thesis (chapters 5 and 6). 3. Can this analysis be validated by its utility in confirming established theories, or by the development of new theories about online public communication and the public sphere? This is the main focus of an overarching discussion (sec. 7) that also elaborates on research questions 1 and 2. In following these questions as guiding lights towards its objectives, this project takes further steps into the exciting field that these questions open up. In so doing, it aims to help researchers from as many related disciplines as possible.

Chapter 4

Methodology and Research Design Methoden ermöglichen es der wissenschaftlichen Forschung, sich selbst zu überraschen. (Luhmann, 1997, p. 37) Methods allow scientific research to surprise itself.

4.1

Methodology

The discourse between quantitative and qualitative researchers, and between followers of different epistemological ‘isms’, has a long history and prominent standing in social science research. For a natural scientist, however, the methodological debates in social science are often surprising. Before beginning to understand the subtle but significant difficulties connecting network science with communication and media studies theory, it is important to be aware of this conflict within the social sciences themselves. This section summarises relevant parts of this discourse; shows how it is slowly steered into a new direction by the computational turn and the use of big social data; addresses the importance and dangers of pattern recognition in the creation of knowledge; and, finally, argues for pragmatism and methodological pluralism as necessary to the investigation and building of a theory of public communication on a society-wide scale with network science methods. Together, this approach provides a schematic, but important, methodological (i.e., epistemological and teleological) base for the research design of this project in sec. 4.2. 88

4.1. METHODOLOGY

89

4.1.1 Conflicting Epistemologies, Conflicting Methods? Not only are the disciplinary boundaries of media and communication studies blurry, they are also characterised by differing research cultures within the field. Josephi (2017, p. 476), from a German perspective, notes that “in [Australian] communication studies cultural studies have played a major role. In terms of research methodology, Australia long eschewed large quantitative studies, possibly because of its size, nor was it overly influenced by Britain’s Marxist theoretical approach”. Cultural studies and more quantitative approaches to mass media studies seem to be positioned on two sides of a historical, epistemological, and methodological divide that goes beyond media and communication studies. This divide in social and behavioural science “throughout the 20th century” (Onwuegbuzie & Leech, 2005, p. 375) can be traced back to differing epistemologies – or ‘worldviews’, as Creswell (2014) calls them – most prominently, post-positivism and constructivism. Post-positivists follow a “deterministic philosophy in which causes (probably) determine effects or outcomes”. Therefore, “developing numeric measures of observations and studying the behavior of individuals becomes paramount for a postpositivist” (Creswell, 2014, “The Postpositivist Worldview”, para. 2), based on the assumption that researchers can approximate an objectively measurable truth about reality. Constructivists, on the other hand, try to “make sense of (or interpret) the meanings others have about the world” and “inductively develop a theory or pattern of meaning” (Creswell, 2014, “The Constructivist Worldview”, para. 1). In this respect, constructivism is closely related to interpretivism, both seen as being opposed to (post)positivism. In the social sciences, differing epistemologies sometimes lead to the thesis “that quantitative and qualitative research paradigms and methodologies cannot and must not be mixed”. This view stems from a “tendency among researchers to treat epistemology and method as being synonymous” (Onwuegbuzie & Leech, 2005, p. 376). Meanwhile, “[i]n other sciences (‘real’ sciences) methodological pluralism seems to be taken for granted and goes unremarked” (Sechrest & Sidani, 1995, p. 77)1 . “So far as we can tell, it is only in the social sciences that any quantitative vs. qualitative controversy 1 I want to note, at this point, that I regard the division of academic research into ‘real’, ‘hard’, or ‘soft’ categories as neither instructive nor helpful.

90

CHAPTER 4. METHODOLOGY AND RESEARCH DESIGN

rages” (Sechrest & Sidani, 1995, p. 77). The emotionality that made this controversy ‘rage’ stems from barriers to understanding on both sides and, quite possibly, an overly passionate search for whatever each side considers to be the truth. On the positivist side2 , researchers often “disregard the fact that many research decisions are made throughout the research process that precede objective verification decisions” (Onwuegbuzie & Leech, 2005, p. 377). Furthermore, social science research objects cannot be measured with the same reliability as is the case in natural science. They are almost always social or philosophical constructs (Onwuegbuzie & Leech, 2005, p. 377), deriving their meaning from interpretation, and evading clear definitions. This is often reflected in low reliability scores and a lack of statistic power in quantitative social science research; in other words, replicability often becomes a problem. Therefore, “the techniques used by positivists are no more inherently scientific than are the procedures utilized by interpretivists” (Onwuegbuzie & Leech, 2005, p. 378). On the interpretivist or constructivist side of the argument, the “claim that multiple, contradictory, but valid accounts of the same phenomenon always exist” misleads researchers to adopt an “ ‘anything goes’ relativist attitude”; at times, it even neglects the possibility of clear definitions in the realm of natural sciences. The resulting lack of a replicable rationale for their interpretations advances a lack of comparable standards and generalisation where they might actually be possible and needed. It also, therefore, leads to a lack of trust in research in general (Onwuegbuzie & Leech, 2005, p. 378). Onwuegbuzie & Leech (2005, p. 376) see this controversy taking place in three camps – purist, situationalist, and pragmatist – “with purists and pragmatists lying on opposite ends, and situationalists lying somewhere between purists and pragmatists”. Purists see both epistemologies as incompatible, and argue, therefore, that “quantitative and qualitative approaches cannot and should not be mixed” (Onwuegbuzie & Leech, 2005, p. 376), while seeing one of the two approaches as superior to the other. Situationalists see value in both, depending on the research question, but still argue for a mono-method approach. Pragmatists, however, decouple the epistemology from the methods and argue that “a false dichotomy exists between quantitative and quali2

As a precise distinction between the different ‘flavours’ of epistemologies would go beyond the scope of this thesis, post-positivism and positivism are regarded to be in the same historical stream, and are used interchangeably within this text. The same applies to constructivism, interpretivism, and transformative approaches, which are considered as being opposed to (post-)positivism.

4.1. METHODOLOGY

91

tative approaches” (Onwuegbuzie & Leech, 2005, p. 377). They see both approaches as not inherently related to either worldview and, therefore, believe that “the research question should drive the method(s) used” (Onwuegbuzie & Leech, 2005, p. 377). As I argue in the remainder of this section, a pragmatic approach is a necessity for this project.

4.1.2 Organised Complexity, the Computational Turn, and Big Data The conflict described above can, in part, be explained by an inability to handle problems of organised complexity due to a lack of methods and methodology: for example, on the quantitative side, to appropriately consider and measure local specificities; and, on the qualitative side, to obtain an overview of the social systems under consideration. The computational turn in the humanities, and the advent of big data, has made this more obvious; it has also made the need for a pragmatic approach to the social sciences more pressing. As described in the literature review in sec. 2.5.1, Weaver (1948, pp. 539–540) defines problems of organised complexity as “problems which involve dealing simultaneously with a sizable number of factors which are interrelated into an organic whole” that “cannot be handled with the statistical techniques so effective in describing average behavior in problems of disorganized complexity”. These problems of ‘disorganised complexity’ describe problems, such as the temperature of a gas or the calculation of insurance premium for common events in large populations. Even if Weaver was mainly referring to natural and life sciences, this description of organised complexity also applies to many, often qualitatively approached research questions in the social sciences, and to all the complications that the social context brings with it. Luhmann (1997, p. 12) asks whether it is possible to show that heterogeneous functional fields of society – such as science, law, economy, politics, mass media, and (even) intimate relationships – exhibit similar structures to enable us to build a theory of society as a whole. Referring to organised complexity, he argues that when describing society, if the description of the system is part of the system and there can be a multiplicity of descriptions, the system becomes ‘hyperkomplex’. Therefore, the prob-

92

CHAPTER 4. METHODOLOGY AND RESEARCH DESIGN

lem of lacking a methodology for understanding these systems is even more pressing (Luhmann, 1997, p. 22); and, with this pressure, the call for a ‘new math’ by Weaver (1948) gains relevance for the social sciences. In 1997, Luhmann saw the search for this methodology as having failed for almost half a century, and suggested that the way forward was to build theory without waiting for it (Luhmann, 1997, p. 23). However, he came to the conclusion that the contraposition of quantitative and qualitative methods that dominates the sociological methodology discourse, rather distracts from the actual problems. He argues, from a constructivist position, that methods are not meant to correctly describe reality. They should rather be seen as a refined means of an intra-systematic generation and processing of information. Methods generate reality by themselves, by creating information that has not previously existed. Their results might, in the end, be only remotely connected to the research object or subject at hand. Using a particular method might actually change the system, because it changes the perceived reality. This is one of the reasons why methods allow scientific research to surprise itself (Luhmann, 1997, p. 37). Luhmann argues that if it were possible, with these postulated methods, to find similar recurring patterns in heterogeneous social fields – such as economics, politics, sports, and culture, for example – this would not be coincidental. Thus, they might lead to the building blocks of a social theory, as they would have to be based in the form of the society’s system (Luhmann, 1997, p. 43). Below, I further discuss how network science is a means to reveal such recurring patterns, and how patterns themselves, as a concept, are able to integrate seemingly opposed methodologies. One of these methods is experimental setups. Habermas (2006, pp. 413–414), in investigating the truth-finding qualities of political deliberation, refers to “an impressive body of small-group studies that construe political communication as a mechanism for the enhancement of cooperative learning and collective problem solving”, involving experimental setups of interacting people. “However, small-scale samples can only lend limited support to the empirical content of a deliberative paradigm designed for legitimation processes in large-scale or national societies” (Habermas, 2006, p. 414). This is where large-scale digital trace data and the so-called ‘digital humanities’ become relevant. Instead of small-scale experiments, with the help of network science, it now

4.1. METHODOLOGY

93

seems possible to observe the structures and patterns in situ that might lead us towards an evidence-based social theory. In the beginning, ‘digital humanities’ “were often seen as a technical support to the work of the ‘real’ humanities scholars”. Over time, however, “computational technology has become the very condition of possibility required in order to think about many of the questions raised in the humanities today” (Berry, 2011, p. 2). Now, arts, humanities, and social sciences “use technologies to shift the critical ground of their concepts and theories – something that can be termed a computational turn” (Berry, 2011, p. 11). This computational turn is affecting more than simply the concepts and theories within these disciplines: Computational methods “also allow the modularisation and recombination of disciplines within the university itself” (Berry, 2011, p. 13). This ‘modularisation and recombination’ challenges the academic discourse of the social sciences and brings them into contact, and competition, with disciplines and professional fields that have never considered, or have ceased to consider, methodology and epistemology as being particularly useful. Furthermore, beyond academia, “at all levels of society, people will increasingly have to turn data and information into usable computational forms in order to understand it at all” (Berry, 2011, p. 15). However, if we distinguish between facts, data, and information, information reduces uncertainty regarding a question (Sechrest & Sidani, 1995, p. 81). Data and facts, on the other hand, “may erupt without any effect on our uncertainty about things that are our focal concerns” (Sechrest & Sidani, 1995, p. 81). This is often the case with so-called ‘big data’, especially when it “exceeds our ability to effectively collect, manage, transmit and analyze it”. This “failure” to handle the data is actually one way to define the term ‘big data’ itself (Halavais, 2015, p. 584). Despite many other attempts to conceptualise it, the definition of the term ‘big data’ remains as vague as the concept itself. One definition that is favoured (for this project) is that data is ‘big data’ if it has not been sampled down for analysis; traditionally, for data of comparable magnitude, a purposive sample would have been used. This opens new research possibilities, as “it is difficult to sample when you are seeking out anomalies: the needle in the haystack” (Halavais, 2015, p. 586). This is an important point: thereby, big data allows data-driven researchers to address the effects of outliers (instead of focussing on what is the norm), and enables us to address

94

CHAPTER 4. METHODOLOGY AND RESEARCH DESIGN

problems of organised complexity. By this definition, network analysis also necessitates big data: more often than not, “the more you know further about every node and the relationships between these, the better measure of the totality can be reached” (Halavais, 2015, p. 586). Within the computational turn in the humanities, the use of big data is a dividing issue. On the one hand, “there is a suggestion that the social scientist is properly relegated to the dustbin of history, superseded by the data scientist, who, unhindered by theoretical baggage, is able to finally perfect the ideal of ‘social physics’ ” (Halavais, 2015, p. 584). On the other, there are fears “that we miss opportunities for observation or for theorization …, that an emphasis on big data places the social scientist in the service of the technologies, platforms, institutions, and economic structures that produce, collect, and concentrate this information” (Halavais, 2015, p. 584). On the extreme of the data-driven argument, big data led some to predict ‘The End of Theory’ (Anderson, 2008). This prediction that “with enough data … answers can be found without questions being necessary … and without questions, theory is superfluous” is not a new one, and “wrong in a number of ways” (Halavais, 2015, p. 586). The major flaw in this prediction has to do with the difference between facts and information: “As Henri Poincaré famously noted, the scientist is charged with creating order: a collection of facts is no more a science than a heap of stones is a house” (Halavais, 2015, p. 587). The results of most data science represent only “one piece of a process that leads to understanding – to social theory” (Halavais, 2015, p. 287). This ‘piece of a process’ is often a ‘pattern’. I discuss, in sec. 4.1.3, why patterns are not the end-result, but a tool for knowledge creation. Nevertheless, “big social data requires us to think about how the abstract is related to the particular, and recognize that this relationship is complex, tenuous, and difficult to discern. It aims to ground grand social theory in everyday experience” (Halavais, 2015, p. 587). Thus, big data has great potential for the social sciences as it challenges them with the method(olog)ical dilemma that Luhmann (1997) seems to have given up on. With big social data, there shimmers some light on the horizon to answer a central question for social theory: “how society shapes, and is shaped by, individual actions” (Halavais, 2015, p. 587). The answer to this question, however, is hindered by a lack of interdisciplinary

4.1. METHODOLOGY

95

exchange, and differing research cultures and worldviews. To unlock the potential of big data to address this central question for social theory, methods such as machine learning, network science, or agent-based modelling come to mind. However, “when those interested in big data from the perspective of computing technologies or algorithms take on questions of social dynamics, social theory appears as a quick citation, if at all, and methodological questions are likewise given short shrift” (Halavais, 2015, p. 589). On the other hand, when social scientists fail to take on some of the technical challenges of acquiring and manipulating big social data …. the lack of voices from sociologists and others with an understanding and interest in social theory is an abrogation of responsibility. (Halavais, 2015, p. 589) What is needed from the social sciences is a “shift in perspective and … a set of practical skills” (Halavais, 2015, p. 591), involving a “foundational understanding of programming and networking to work effectively with programmers” (Halavais, 2015, p. 592). This necessary shift in perspective is also an epistemological one. Most quantitative social scientists already have a working knowledge and understanding of ‘traditional’ statistical methods, such as linear or logistic regression, but not of more complex algorithmic approaches (Veltri, 2017, p. 3). The traditional methods, however, have “large shortcomings … as the complexity of datasets increases” (Veltri, 2017, p. 3): often, in using the traditional statistical methods, there is “a multiplicity of [contradicting] models that are still a good fit for the dataset”. At the same time, a ‘global’ model “cast on the entire dataset without considering the possibility of ‘local’ models” does not account for the actual complexity of the dataset or, therefore, for the organised complexity present in social interactions. This failure leads “to irrelevant theory and questionable statistical conclusions” (Veltri, 2017, p. 3). Network science is no exception to this disciplinary divide. While basic measures and, (especially) visualisation methods have already been part of social science research for quite a while, a deeper understanding of more complex algorithms, such as for the purpose of community detection, is often lacking. This leads to problems with the interpretation of results (as I discuss in chapter 6). At the same time, these algorithms are often developed without social theory and social research questions in mind (see

96

CHAPTER 4. METHODOLOGY AND RESEARCH DESIGN

sec. 2.5.1). This approach leads to the detection of patterns only, and provides no real insights from a social science perspective. In these cases, if researchers must at least have a situationalist’s attitude to quantitative methods when using these approaches (see sec. 4.1.1), differing epistemologies are no longer the main barrier. However, a differing “taste for teleology”– i.e., “a different interpretation of what they mean by answering the question ‘why’?”’ (Hidalgo, 2016, p. 12) – still leads to incongruences between methods developed by mathematicians or natural and computer scientists on the one hand, and social scientists or media and communication scholars on the other (see also sec. 2.5.1): “Social scientists look for answers to why questions that [involve] the purposeful action of actors …. Natural scientists, on the other hand, answer why questions by looking at the constraints that limit the behavior of the system” (Hidalgo, 2016, p. 12). For natural scientists, this focus on constraints leads to a heightened sensibility for patterns in the systems that are often used for the building of hypotheses. Constraints usually lead to recurring patterns. Thus, patterns might lead us to deduce the existence of natural laws stemming from the constraints of a dynamic system, and to describe these laws in formal hypotheses that might then be tested in further research. Perhaps due to the lower importance given to the methodology discourse in these fields, however, this strategy is often not formally noted. Also, in the digital humanities, while the concept of patterns is often used, it is rarely clearly defined or given an epistemological foundation. This lack of definition and foundation is the motivation for the next section.

4.1.3

Patterns, Deduction, Induction, and Abduction

The concept of patterns is central to network science. In their extensive, canonical overview of social network analysis methods, Wasserman & Faust (1994, p. 8) refer to patterns to clarify what they mean by structure: “From the view of social network analysis, the social environment can be expressed as patterns or regularities in relationships among interacting units. We will refer to the presence of regular network patterns in relationship as structure.” On the following 700 pages, the term ‘pattern’ can be found almost 120 times. In a later, complementary publication (Carrington,

4.1. METHODOLOGY

97

Scott, & Wasserman, 2005), the term ‘pattern’ is used 90 times in around 300 pages. However, despite their apparent centrality to the subject, patterns are only implicitly defined in both books, often co-occurring with the term ‘regularities’ (and assuming the reader knows what is meant). Indeed, patterns are central to human perception, as a means to connect prior knowledge and experience with what is observed in the present. Therefore, they appear as something primitive enough not to need further consideration. Also, and especially in the digital humanities, “patterns come up as a shorthand for the shapes and structures that are spotted by human researchers from the information returned by computational processes” (Dixon, 2012). The seemingly primitive nature of patterns is deceiving, however. A discussion of patterns and their role in knowledge discovery is important, as this section (mostly following Dixon (2012)) makes clear. Dixon’s “pragmatic attempt to raise some of the questions that would allow the use of patterns as a justifiable knowledge generation and validation technique” (Dixon, 2012, p. 192) gives a clearer, workable definition of the concept; however, it also helps us to see different epistemological approaches as part of a continuous spectrum of research practice. This perspective is necessary if we are to put any substantial effort into answering social science questions computationally on a stable epistemological basis. Patterns “are constructed from quite different visual stimuli” (Dixon, 2012, p. 194), and “pattern recognition can be seen as matching the incoming visual stimuli to existing mental models” (Dixon, 2012, p. 194). This definition, and the whole discussion that follows, does not apply to visual stimuli only. However, it does lead to two possible pitfalls when working with patterns: First, one might spot similar patterns that are not necessarily comparable – therefore rendering similarity meaningless; and, second, “we cannot spot patterns we haven’t encountered before, and have a tendency to seek out patterns or structures that we have seen before, and possibly even to see patterns where there are none – the concept of apophenia” (Dixon, 2012, pp. 194–195). Nevertheless, if we define patterns as “repeated shapes and structures which are the observable features of an underlying system” (Dixon, 2012, p. 195), they are crucial to the different strains of system theory and systems-based thinking that developed since the middle of the 20th century: “The common feature of all these systems-based

98

CHAPTER 4. METHODOLOGY AND RESEARCH DESIGN

approaches to patterns is that repeated, physically observable features are recognised as emergent and convergent principles that reveal underlying forces and processes” (Dixon, 2012, p. 195)3 . From a systems-based perspective, the “complex systems of culture, mathematics, or physics are inherently difficult to understand and any attempt to rigorously describe or model them will fail to capture the completeness of the system” (Dixon, 2012, p. 197). As they are recurring, stable, or both, patterns as “the emergent features [of a system] that are easy to perceive and recognise” (Dixon, 2012, p. 198) are the result of an evolutionary balancing of the forces of a system. As such, “they are a means of gaining a useful and timely understanding of a complex system” (Dixon, 2012, p. 198); for example, to get insights into biological ecologies, understand social networks, design better hospitals, create a more satisfactory user experience on a website, or find routes of action in interventionist research. Nevertheless, patterns are limited: They are not models of the system; they are structural representations of elements within them. They are not metaphors or analogies for emergent properties of a system; they are physically present and not re-interpreted. Neither are they maps, graphs, or diagrams of an entire system; they are descriptions of emergent features of those systems. They are not created; they are documented and described. (Dixon, 2012, pp. 198–199) On a simplified epistemological continuum – between positivist or empiricist natural science and interventionist action research design, with the humanities in the middle (Dixon, 2012, table 11.1) – the use of patterns as a research tool is fairly unproblematic in design and action research from the interventionist perspective. There, they are not intended to be generalisable, but can be seen as the vocabulary of a language, a part of a process; “judged on their practical effectiveness in making actual change” (Dixon, 2012, p. 200); and only valid within the scope of one project, intervention, or design. This perspective can provide us with an epistemological foundation for their use and validation on the rest of the epistemological continuum: “They are useful contextually within the process, and can be validated within the context of the project and against the other methods being used” (Dixon, 2012, p. 201). 3

see also the short discussion of Luhman’s approach above

4.1. METHODOLOGY

99

To identify useful – and, with this approach, therefore valid – patterns, and to avoid being misled by apophenia, Dixon (2012, p. 201) proposes Peirce’s pragmatic framework for science and knowledge creation “that at a wider scale looks surprisingly similar to action research”. Peirce introduced the concept of abduction to formalise “the hunches, guesses, and intuition that help the natural sciences” (Dixon, 2012, p. 201), as an addition to the accepted practices of deduction and induction. Abductive reasoning is different in that no logical or empirical connection is required, merely spotting patterns in the data. The results of abduction, however, are not necessarily logically or scientifically coherent; they need to be properly tested, either deductively or inductively, or both. (Dixon, 2012, pp. 201–202) This framework allows us to argue that positivist and empirical science, and indeed every “good scientific practise looks like it is wrapped in action research” (Dixon, 2012, p. 204). While hypotheses are verified and validated by deductive and inductive reasoning, these forms of doing ‘hard’ science, which are traditionally accepted as scientific practice, are embedded in “rounds of abductive reasoning, intuition, and the practical concerns of doing science” by the use of patterns. Seeing the humanities between the two extremes of positivism and interventionist practice, Dixon (2012, p. 205) justifies “the use of patterns, either as the simple recognition of shapes and structures already familiar to us, or through the more systemic approach of emergent, repeated structures” for the digital humanities. This discussion leaves us with three conclusions: 1. The concept of patterns, their recognition, and their use in building hypotheses, enables us to bridge between seemingly opposed positivist (or empiricist) and interventionist epistemologies. 2. Patterns, being a central concept of network science, cannot be the end result of research. Nevertheless, they can be a useful means to this end. Furthermore, if they are useful, they are a valid method for knowledge generation from a pragmatist perspective. 3. Because patterns are such a central concept of network science, apophenia is a serious danger, especially for interpretative approaches with network visualisations

100

CHAPTER 4. METHODOLOGY AND RESEARCH DESIGN that therefore need further triangulation with inductive or deductive methods. Together, these conclusions lead to a strong argument that a pragmatist ap-

proach is needed if we want to undertake meaningful media and communication studies from a network science perspective, as is explained in the next section.

4.1.4

Pragmatism and Mixed Methods as a Necessity

It is clear that the “shift in perspective” called for by Halavais (2015, p. 591) is not to be solved by a mere programming upskilling of social science researchers and media and communication scholars. It also needs a shift in the epistemological underpinnings of social science methodology which, over 20 years after Sechrest & Sidani (1995, p. 77) assessed “that the continuing controversy over quantitative versus qualitative methods hinders advancement of social science (and program evaluation)”, still has an affect – especially on the education and curricula of prospective social science researchers and media and communication scholars. This project follows Bruns (2011b) in his argument that if we use big social data generated from internet users’ trace data to better understand society, culture, publics, or media and communication, the necessary research involves two endeavours: first, a relatively open-ended, exploratory engagement with online objects, developing the ‘natively digital’ methods which are appropriate to their study and examining – not least through experimental trial and error – what useful and reliable data may be gathered about them and their users; this is the ‘follow the medium’ stage of the research. Second, the development of new research questions, and new methods of analysis in pursuit of these research questions, which make use of these available data. (Bruns, 2011b, p. 7) The first endeavour especially cannot inherently be guided by clearly defined research questions; rather, it involves a playful, exploratory approach – “experimental trial and error” – looking for patterns in a system that Luhmann would call ‘hyperkomplex’. That is, we are exploring computational methods to investigate computationally collected data about computational practices to gain results about a society that is

4.1. METHODOLOGY

101

affected by those computational practices and, in the end, also by the computational methods themselves. If we want to justify the use of patterns, and the playful, abductive exploration of the data and the tools to inspect it, we cannot do so in a purely (post-)positivist epistemology. At the same time, a purely constructivist or interpretivist approach, and an action research design, does not fit with the intent of this study: to also find ‘quantitative’, generalisable methods for gaining results that, again, need interpretation. This is the case, for example, for community detection algorithms (see sec. 6). This becomes even clearer if we consider that big social (media) data is at the same time both qualitative and quantitative. Most of the time, it is a massive collection of text and metadata, or an abstraction of human behaviour that can only have meaning if interpreted by a reader. However, without computational support, the size of the collection prevents a human reader from extracting more meaning from the data than that collected from a traditional group interview. At the same time, a purely quantitative analysis will usually simply find patterns which might hint at a particular outcome; however, without qualitative interpretation, this cannot be a meaningful or useful result for a social science researcher. This is not a problem if we accept that the quantitative vs qualitative debate in the social sciences could have been justified at a time when many positivists assumed that one could treat society as the rule-based interaction of complex social molecules. Now, however, particularly from the perspective of computational sociology or digital humanities, that debate is actually, and fundamentally, a conflict rooted in historical divisions. It is not based on the actual positions of the worldviews involved, but “stems from misunderstandings and misstatements of the positions involved. From a fundamental epistemological standpoint, we are not sure that any differences exist” (Sechrest & Sidani, 1995, p. 78). The goals of both approaches are the same: “both types of inquirers attempt to explain complex relationships that exist in the social science field”, involving “the use of observations to address research questions” (Onwuegbuzie & Leech, 2005, p. 379). Sechrest & Sidani (1995, pp. 78–80) give a convincing account of their view that both approaches are, from a more detached perspective, actually not separable. They summarise their argument in 7 points:

102

CHAPTER 4. METHODOLOGY AND RESEARCH DESIGN

1. Critics from both sides ignore the distinction between what methodologists and epistemologists postulate as normative or determine in hindsight (‘reconstructed logic’), and what researchers actually do while they do research (‘logic in use’). 2. Ultimately, both camps aim for ‘deep understanding’ of the researched phenomena (Sechrest & Sidani, 1995, p. 79). Positivists also try to find explanations and interpretations (Onwuegbuzie & Leech, 2005, p. 379). 3. Qualitative researchers are not necessarily getting ‘closer’ to their researched phenomena than quantitative researchers (Sechrest & Sidani, 1995, p. 79). 4. Qualitative results are empirical, as are quantitative results. They only differ “in their preference for numerical precision” (Sechrest & Sidani, 1995, p. 79). 5. The qualitative and quantitative approaches are often conflated with obvious interventions in experiments and noninterventionist (i.e., observational) techniques (Sechrest & Sidani, 1995, p. 80). 6. The postulated ignorance of quantitative research regarding context is also a misconception: “The critical importance of context is explicit in the often elaborate arrangements that are made in order to limit the context in particular experiments and in frequent recommendations for repeating experiments under different conditions” (Sechrest & Sidani, 1995, p. 80). 7. In the end, “good science is characterized by methodological pluralism, choosing methods to suit the questions and circumstances” (Sechrest & Sidani, 1995, p. 80). Narrowing down the choices for triangulation by restricting oneself to only quantitative or qualitative methods, would not make any sense: “Methodological pluralism is an absolutely necessary strategy in the face of overwhelming cognitive limitations and biases inherent in human mental processing and responding” (Sechrest & Sidani, 1995, p. 80). Thus, as the distinction between quantitative and qualitative is a “false dichotomy” (Onwuegbuzie & Leech, 2005, p. 384), it is worth considering the distinction between ‘clinical’ and ‘formulaic’ approaches by Sechrest & Sidani (1995). While the

4.2. RESEARCH DESIGN

103

clinical approach “is personal and cognitive and is not constrained by external formal rules”, the formulaic approach “consists of external formal rules for proceeding”. Both approaches can be used for each of the four stages of empirical research (data collection, analysis, interpretation, and utilisation), depending on the question at hand. For example, data collection can be the automated (formulaic) retrieval of text online by certain criteria (e.g., keywords); the analysis can be an automated topic extraction (still formulaic); the interpretation could be clinical (by close reading of texts related to a certain topic); and the utilisation (again formulaic) could be the sharing of results in a database that contains information regarding these online texts (following a strict code-book). A further utilisation would be clinical, as the close reading might lead a researcher to new hypotheses (perhaps based on abductive reasoning). What follows is that “clearly, neither tradition is independent of the other, nor can either school encompass the whole research process” (Onwuegbuzie & Leech, 2005, p. 380). At the same time, “mixed method analyses are not always possible or even appropriate” (Onwuegbuzie & Leech, 2005, p. 381). Nevertheless, to have the knowledge and the mindset to use and accept a variety of approaches – i.e., to be what Onwuegbuzie & Leech (2005, p. 376) “term as pragmatic researchers”– is necessary to increase the chance of choosing the most appropriate approach. This holds especially true for the present project that aimed to connect rather positivist, quantitative network science methods with rather qualitative media and communication theory; and it did this on a scale that used big social media data that is, at the same time, both quantitative and qualitative. Therefore, this project had to be conducted from a pragmatic perspective.

4.2 Research Design The purpose of this project was to identify, present, and further unlock the potentials of combining media and communication studies, big social media data, and network science methods (see chapter 3). To achieve this goal, it followed a multiphase, mixed methods design. This section, after having laid out the epistemological basis in sec. 4.1, fulfils two purposes: It gives an overview and understanding of the overall structure of the

104

CHAPTER 4. METHODOLOGY AND RESEARCH DESIGN

research process, and establishes the purpose of this approach. It outlines the overall multi-phase design in which the two studies are embedded. Specific methods, however, are discussed within the study reports, alongside the results. These methods were not pre-defined, but emerged from, and together with the results – as a result of an exploratory ‘play’ with data. The main approach behind the overall research design of this project is in line with Dixon’s (2012, p. 204) insight that “good scientific practise looks like it is wrapped in action research”. It followed an explorative approach, defining and solving problems as they arrived, and accepting unexpected questions and their outcomes. This exploration was guided by the goals of approaching fundamental topics of media and communication theory; of asking questions from a media and communication studies perspective; and of seeking to answer these questions with big social media datasets and network science methods. In the sec. 4.2.1, I explain the motivations for choosing a multiphase mixed methods design, and give an overview of the structure of the research design. A brief overview of the data sources used for this project are given in sec. 4.2.2. I then give an account of a cyclic research framework, developed on the basis of the methodological considerations in sec. 4.1. This framework guides the two studies reported on in the following two chapters, and outlines and reflects on the processes of knowledge creation within them. This framework finds its application in sec. 4.2.4, in which I describe how the cycle is broken down across two studies, consisting of three stages: interpretation of theory; exploration of network science methods; and the final integration of the results and insights with theory. Limitations of this research design are considered in sec. 4.2.5. This is then followed by the ethical considerations affecting this project in sec. 4.2.6.

4.2.1

General approach

Following Creswell (2014 I/1./Mixed Method Designs, para. 8), the design of this research can be classified as a multiphase mixed methods design, in which “concurrent or sequential strategies are used in tandem over time to best understand a long-term program goal”. As such, it is actually open-ended, and the research cycle (described in sec. 4.2.3) could well have continued after this project ended. A multiphase mixed

4.2. RESEARCH DESIGN

105

methods design is particularly useful if the expected outcome is a “formative and summative evaluation” (Creswell, 2014, table 10.3) – in this case, of the usefulness of network science methods in relation to media and communication studies. This thesis reports on two studies within this cycle, which consists of an exploratory and an explanatory mixed methods phase. An exploratory sequential mixed methods design starts with a qualitative phase, and is followed by a quantitative phase: for example, to develop “better measurement instruments” (Creswell, 2014, table 10.3). An explanatory sequential mixed methods design starts with a quantitative phase and is followed by a qualitative phase (often, to explain “quantitative results with qualitative data” (Creswell, 2014, table 10.3)). In general, “theory use in mixed methods studies may include using theory deductively, in quantitative theory testing and validity, or using it inductively as in an emerging qualitative theory or pattern” (Creswell, 2014, “Mixed Methods Theory Use”, para. 1). As is explained in sec. 4.2.3, both approaches are used to find and validate the network science methods. The general approach that guided this thesis was to be transparent and descriptive throughout the research process, rather than presenting the methods as ‘an elusive given’ at the outset. To this end, the abductive, sometimes deductive process of generating or finding methods for the exploratory phase of the studies was made as explicit as possible.

4.2.2 Data This project mainly made use of data retrieved from Twitter’s Application Programming Interfaces (APIs), and was acquired by various means. Twitter as the research object was chosen for several reasons: • First, and foremost, even though Twitter has now restricted access to its data, as a mostly publicly used platform, compared to other platforms, its data is easier to access (see, e.g., Woodford, Prowd, & Bruns, 2017, p. 78). • At the same time, the public nature of Twitter and its popularity among journalists, celebrities, and politicians (as a direct channel to their audiences), gives the platform a remarkable prominence in public discourse. (This was recently underlined by the divisive use of Twitter by the current president of the United

106

CHAPTER 4. METHODOLOGY AND RESEARCH DESIGN States of America (USA).)

• Furthermore, before Twitter started to prioritise certain tweets with an algorithm in 20164 , it was less of a blackbox with regard to the spread of items, and had more transparent and easy-to-understand mechanisms than, for example, Facebook. This made it possible to model the mechanics from the outside, without major assumptions. • Twitter mechanisms resemble basic mechanisms of public communication in general. This might also be the reason why other platforms have often very similar, if not the same mechanisms, under different names: – listening (follow) – answering/commenting (reply) – naming (mention) – repeating/citing (retweet/quote) – acclaiming (like) • Twitter is tied into an intermedia information flow as a centrepiece between news outlets, blogs, and other produsers. Because of its similarity to other social media platforms, mastering the analysis of Twitter also helps us to understand the use and networked entanglement of other online media. Due to Twitter’s restrictions on researchers who are not able to afford requests to its commercial data vendors, working with datasets that cover activities of accounts on a national or society-wide scale, however, necessitates a pre-existing infrastructure. Therefore, the Tracking Infrastructure for Social Media Analysis (TrISMA) (Bruns, Burgess, et al., 2016) provided most important datasets for this project: • A collection of the details of 3.72 million Twitter accounts, identified as Australian by their description, their timezone setting, and location fields (A detailed explanation of how this data was gathered and of its limitations is given by Bruns et al. (2017). It is also addressed in the second study in section sec. 6.); • The followings of these accounts identified as Australian (i.e., 720 million connections from the 3.72 million in question, of which 167 million were directed at 2.4 million accounts within the dataset); and 4

see, e.g., https://www.theverge.com/2016/2/6/10927874/twitter-algorithmic-timeline

4.2. RESEARCH DESIGN

107

• Almost all tweets tweeted by these accounts from 2006 until April 2017. Additionally, this project made use of datasets collected by members of the Digital Media Research Centre (DMRC)5 ; the Social Media Research Group (SMRG)6 at Queensland University of Technology (QUT); and the Oxford Internet Institute (OII) at the University of Oxford (either with the Digital Methods Initiative Twitter Capture and Analysis Toolset (DMI-TCAT)7 (Borra & Rieder, 2014), or with custom tools). This data was further supplemented with data gathered via the Twitter APIs. If a custom or self-programmed tool was used, the approach, and the relevant limitations (due to restrictions imposed by Twitter APIs) are further explained in the respective study report.

4.2.3 Research Cycle The research design described in the next section is based on the methodological considerations in sec. 4.1. Its description here, together with the methodological considerations above, aims – besides its primary purpose of making the research process transparent – to take a further step towards a solid methodological framework for the use of patterns; mixed, and digital methods, in general; and network science, in particular, in social science and media and communication studies. As Dixon (2012) explains (see sec. 4.1.3), good scientific practice is embedded in a cycle of abductive phases, leading, for example, to hypotheses that are verified by deductive and inductive phases. Sechrest & Sidani (1995) argue that the four stages of empirical research (data collection, analysis, interpretation, utilisation) can be either clinical or formulaic (see sec. 4.1.4). This follows a critical stand against the distinction between quantitative and qualitative that does not hold up against a fundamentally epistemological perspective. They themselves admit that the term ‘formulaic’ is not necessarily the best choice, and ‘clinical’ might also be misunderstood outside of certain disciplines. For these reasons, I chose the terms rule-based and responsive, respectively. Creswell (2014) describes, among others, two sequential mixed method designs: the exploratory and the explanatory mixed methods design. The former method transitions 5

https://research.qut.edu.au/dmrc/ http://socialmedia.qut.edu.au/ 7 https://github.com/digitalmethodsinitiative/dmi-tcat 6

108

CHAPTER 4. METHODOLOGY AND RESEARCH DESIGN

from a qualitative (i.e., responsive) to a quantitative (i.e., rule-based) phase (e.g., to find rule-based measures or means to verify insights from a responsive observation). The latter transitions from a rule-based to a responsive phase (e.g., to better understand the outcome of a rule-based experiment).

Figure 4.1: Research cycle guiding each test study

Together, these concepts can be incorporated into a research cycle, as depicted in fig. 4.1. This cycle was the basis of the research design for this project. A study could start, and end, at any point of the cycle. In this project, however, each study started with a responsive phase (on the left of the cycle in fig. 4.1). Based on ‘qualitative’ data, insights, hypotheses, or theory, this phase partly deductively – but, necessarily, also abductively – aimed to find rule-based methods; in this case, preferably network science methods, to verify our responsive observations. This exploratory phase of the cycle provided results that, rule-based and inductively, led to patterns. This initiated the explanatory phase of the cycle. To interpret these patterns, we had to be even more responsive, and make an abductive jump to hypotheses or interpretations that resulted

4.2. RESEARCH DESIGN

109

from a closer look at the underlying data. If the missing part in this cycle is the method, and if we can close this cycle, we have found refined means for the intra-systematic generation and processing of information – that is, methods as Luhmann (1997) characterises them (see sec. 4.1.2).8 If these methods ultimately make it possible to lead back to theory and keep the cycle going, they are useful within the scope of the goals of this project and, therefore, valid. From a rather positivistic perspective, the use of empirical data throughout the cycle ties the research itself into perceived reality. If we encounter contradictions or inconsistencies within this cycle, we can assume that one part of the cycle is flawed (e.g., the theory or the method). These flaws are likely to occur in the abductive jumps in the responsive half of the cycle. Now, from a natural science perspective, one might argue that this leads to a need to minimise the width and frequency of the abductive gaps, and to take smaller, more cautious steps. However, this strategy would have three implications: • First, purely rule-based reasoning, when initiated from the wrong starting point, is in danger of never leaving a path that leads to a dead end; • Second, the hyperkomplex systems we are dealing with in society do not allow for precise enough definitions to argue on a purely rule-based basis; or, if one tried to do so, this would be wrong in any other context than the meticulously defined one – a situation that is figuratively comparable to an uncertainty principle; • Third, social science is not solely about understanding historical occurrences; to reach actionable results in a timely manner, we have to rely on pattern recognition and, therefore, on abduction and heuristics (see sec. 4.1.3). At some point, a step out of the cycle is needed. To understand frictions along the path of the cycle, it is necessary to stop, take a step back, and reflect on the system created within the cycle; on the systems it is part of; and on the systems it creates information about. This is, again, mainly an abductive and interpretative process, leading to insights about the choices made in the research process. These insights can then be tested in the following iterations of the research cycle. 8

Here, the term ‘information’ is used in the sense of data or facts that reduce uncertainty regarding a question (see sec. 4.1.2).

110

CHAPTER 4. METHODOLOGY AND RESEARCH DESIGN

4.2.4

Research Design

Figure 4.2: Overview of the research design

Applied to the project at hand, the research cycle was repeated twice; that is, in two studies to explore, find, and develop, network science methods for large-scale online media studies. As depicted in fig. 4.2, this was followed by taking a step out of the cycle to reflect on the underlying difficulties, the usefulness of the methods found, and to evaluate network science methods’ potentials for theory. The consecutive cases were chosen in a way that allowed a step-wise increase of the data volume, while addressing the focus on dynamics (especially the diffusion of [dis] information) and structures (particularly, communities and publics) in a networked public sphere. In detail, the topics to be investigated were: • Diffusion of news, opinion, and information (see sec. 2.4.2.1), including concepts such as gatekeeping (sec. 2.4.2.3); one-, two-, and multi-step flows; and opinion leaders (sec. 2.4.2.4). These investigations were coupled with a network science

4.2. RESEARCH DESIGN

111

perspective on virality and contagion (sec. 2.5.4). This was the subject of a first, smaller-scale study (chapter 5), and dealt with responses on Twitter to the so-called ‘Sydney Siege’, a terrorist attack in Sydney (pre-Christmas 2014). • The complex of concepts – such as audiences, publics, issue publics, public spherules, communities, filter bubbles, and echo chambers (sec. 2.4.2.2) on a national scale – that were paired with algorithms for community detection (sec. 2.5.3) in a second study (chapter 6). This was addressed by investigating seemingly topically motivated clusters in the follower network of Australian Twitter accounts (see sec. 4.2.2). In both cases, the research design involved three stages that move around the research cycle described above (in sec. 4.2.3): 1. The interpretation of theory; 2. The exploration of tools; and 3. The use of the results to reflect on theory. After two repetitions of this cycle in the studies described above, a step out of this cycle led to an evaluation stage.

4.2.4.1 Stage 1: Interpretation of Theory The approach to start from a theoretical topic stems from research question 1 (see sec. 3) regarding the interpretation of communication and media studies concepts and theories from a network science perspective, and corresponds with the responsive start in the research cycle described above (upper left in fig. 4.1). This approach was crucial as it ensured, following the principle of guided exploration, that I was not exploring methods that were not suitable for connecting back to theory. Following this principle involved the extensive literature review in chapter 2, and the conduct of the studies in collaboration with media and communication studies scholars.

4.2.4.2 Stage 2: Exploration of Network Science Methods Following the circle further on the exploratory path to finding or developing methods, I addressed the second guiding question of this project, regarding the application of

112

CHAPTER 4. METHODOLOGY AND RESEARCH DESIGN

network science methods. Building on the interpretation of theory from stage 1, it was necessary to find ways to operationalise hypotheses suitable for network science methods relevant to the case at hand. After addressing this theoretical task, it was necessary to find the tools to apply the methods. Challenges The choice of methods and tools faced three challenges (specific to this project) that determined the criteria for evaluating methods and tools. These were: • A scalability challenge (because the relevant datasets were large); • A sustainability challenge (because the findings have to survive in an academic context); and • A usability challenge (as this research needed to use advanced, sometimes only recently developed, technologies). Scalability The goal was to examine datasets on the scale of a society or nation. This necessitated datasets that were beyond the scope of desktop applications. Thus, every tool, algorithm, programming language, or method I used had to be scalable to datasets of this size. In order to enable an explorative approach to analysing the data and its possibilities, it had to be possible to analyse a dataset of interest in an iterative way during this project Sustainability For the developed methods to survive in an academic context, they needed to remain • reproducible, • transparent, • adaptable, and • cost-efficient. This necessity led to three guidelines: • If there are alternatives, open source software is always the preferred option. • Tools used should have an active and thriving community, so that we can expect that they will be maintained, supported, and developed for as long as possible.

4.2. RESEARCH DESIGN

113

• Tools should be able to interact with a plentitude of other tools and (especially) platforms. Usability The tools used needed to be as easy to learn as possible. At the same time, the resources to enable this learning had to be easily accessible and affordable. As this project’s goal was to explore promising candidates rather than ready-made tools and methods, this challenge could seem less relevant. Nevertheless, it led to a principle that I followed in choosing the tools: they needed to be well documented; (preferably) they had available, well-developed learning resources; and they were easily discoverable. Tools Used

The technologies used can be classified into two categories:

1. Technologies for data gathering, processing, storage, filtering, and access, as follows: • DMI-TCAT (Borra & Rieder, 2014) is a Twitter data gatherer developed by the Digital Methods Initiative that “allows one to retrieve and collect tweets from Twitter and to analyze them in various ways”9 . • Python10 is a programming language with capabilities ranging from the writing of simple scripts to the programming of smartphone apps, or large-scale services such as Dropbox. It features an extensive ecosystem for data science11 . For most platform APIs, there are language wrappers for Python available. According to a survey in July 2014, Python surpassed Java as a first language for teaching programming in 27 out of 39 top US computer science departments12 . However, due to its easy-to-learn, flexible syntax, it also seems to have become popular in other disciplines as a “glue language” between different technologies. • Google BigQuery 13 is a database Infrastructure as a Service (IaaS) that offers a REST-API as well as a web interface; these features make possible the quick 9

https://github.com/digitalmethodsinitiative/dmi-tcat https://www.python.org/ 11 e.g., the modules Pandas, NumPy, SciPy, SciKit-Learn, and several modules for data visualisation, mostly based on matplotlib 12 https://cacm.acm.org/blogs/blog-cacm/176450-python-is-now-the-most-popular-introductoryteaching-language-at-top-u-s-universities/fulltext 13 https://cloud.google.com/products/bigquery/ 10

114

CHAPTER 4. METHODOLOGY AND RESEARCH DESIGN analysis of massively large datasets. This made it the database of choice for TrISMA. • Neo4j 14 is an open source database with a storage design and data model that is optimised for networked data, or graphs. The main benefits of using a graph database, instead of (traditional) relational databases that store their data in table-like formats, are performance increases by orders of magnitudes in a large set of graph-specific problem settings (Robinson, Webber, & Eifrem, 2013, p. 8); and a more natural, semantic way to model networked data.

2. network analysis: • Gephi 15 (Bastian et al., 2009) is a popular network analysis application with a strong focus on a graphical user interface and interactive network visualisation. • graph-tool 16 (Peixoto, 2017b) is a Python module that makes use of the performance achievable with the lower-level programming language C++ for high-performance network generation, manipulation, analysis, and visualisation. • NetworKit 17 (Staudt, Sazonovs, & Meyerhenke, 2016) is a Python module that takes a similar approach to graph-tool, wrapping high performance C++ code in a Python interface. However, it is also conceptualised as a test-bed and platform for the development of novel network analysis algorithms. It does not provide network layout and visualisation capabilities.

4.2.4.3

Stage 3: Reflection with Theory

The last stage of the studies consists of the explanatory half of the research cycle. With the results of the rather rule-based network science methods, what is left is their interpretation. Rule-based network science methods mostly result in patterns, or in the verification of patterns found by more responsive methods of investigation, such as the interpretation of an (algorithmic, and therefore rule-based) network visualisation. If 14

https://neo4j.com/ https://gephi.org/ 16 https://graph-tool.skewed.de/ 17 https://networkit.iti.kit.edu/ 15

4.2. RESEARCH DESIGN

115

this is the case, following the methodological framework described in sec. 4.1, abduction can lead us to new hypotheses, or the evidence found can lead (via interpretation) to an adjustment of existing theory. As this process is again prone to errors – either due to researcher bias or apophenia, for example – the cycle can start again in order to find methods to triangulate the findings.

4.2.4.4 Stage 4: Evaluation While the three stages above lead to an outcome in methods, to gain methodological insights and to evaluate these methods, it is necessary to take a step back. This leads to a stage of reflection outside of the cycle. While this stage formally takes place after having gone through the research cycle a few times, in practice, it happens in parallel with the research process. This parallel evaluation also leads, in part, to the methodological considerations in sec. 4.1, and to the abstraction of, the research design in this section. The results of this reflection regarding the underlying difficulties in using network science in media and communication studies, the utility of these methods, and their potential for theory building, are presented in the discussion in chapter 7.

4.2.5 Limitations 4.2.5.1 Design It is important to note that the introduction of rule-based methods does not make the research process any more rigorous, or bring it closer to any objective reality. The reason for this is (at least) two-fold: First, if the theory or the data is flawed, even with sound methods, we can only go on to building flawed hypotheses – ex falso quodlibet; and, second, social theory is not simply right or wrong, but depends on interpretation. Therefore, the method can only be sound and useful within the scope of the question and theory at hand. From a constructivist perspective, it can be interpreted as a sophisticated means to process and generate information. From a post-positivist perspective, if method or theory is flawed, the use of ‘real-world’ data will ultimately lead to contradictions after a sufficient number of cycles. From a pragmatic perspective, this usefulness is enough to validate the method.

116

CHAPTER 4. METHODOLOGY AND RESEARCH DESIGN The methods explored with our approach need to be tested further, however,

to verify their function and triangulate their results. This can be done by applying them to more cases for which some kind of ground truth exists; however, this was not possible within the scope of this project. Furthermore, the last stage of the project (i.e., reflecting on the usefulness of the methods) is necessarily subject to a confirmation bias. I have tried to address this by being open in describing the subjective decisions in the studies in as much detail as possible, and by taking a self-critical stance in the concluding discussions. Part of this self-critical stance is to admit at the outset that, while being purposeful for this project, the decision to work with Twitter data, and the self-restriction to network science methods narrowed the perspective to a network paradigm one. While it is possible to model many communication phenomena within networks, not all of these are necessarily well, or comprehensively, represented by these models. A network paradigm can lead to a functionalist perspective on communication and the public sphere; it is not only in danger of neglecting deeper causes for observed behaviours, but also the context in which the phenomena take place. With this limitation of the chosen object of study and methods in mind, however, as explained above, the use of big social data, and the application of network science methods, shows great potential to add to the knowledge present in media and communication studies – especially given its functionalist approach.

4.2.5.2

Data

First, as already pointed out in sec. 2.2.1, Twitter’s publicly available APIs exhibit limitations in the completeness of their data. Driscoll & Walker (2014) point out that “the Streaming API excels at longitudinal data collection, but is a poor choice for massive, short-term events”, and the costs of accessing the complete datasets are often prohibitively high. Tromble & Storz (2017)18 tested the Search API which, according to Twitter, aims to return the most ‘relevant’ tweets. They found that if a keyword is contained in a high volume of tweets, this API omits a significant portion of tweets compared to the complete dataset. I explain the limitations of these APIs in more detail in the relevant sections of the study reports in the following chapters. However, 18

summary available here: http://snurb.info/node/2279

4.2. RESEARCH DESIGN

117

it must be noted that this is actually just one example of a bigger problem for the stream of research for which this thesis is arguing: Whereas the reliability and validity of established social scientific methods depend on their transparency, big social data are almost universally produced within closed, commercial organizations. In other words, the stewardship of this unprecedented record of public discourse depends on an infrastructure that is both privately owned and operationally opaque. (Driscoll & Walker, 2014) Even though this limitation definitely affects the validity of the results, the data can be arguably valid for our goal in method and methodology. Despite not being complete, it can resemble a complete, large scale dataset, close enough to completeness to develop methods and gain methodological insights. These methods and insights can then be applied; and, after having justified their use with projects such as this one, they will be helpful to researchers who are actually able to work with complete data. This possibility could become more likely with the recent introduction of Twitter’s premium APIs19 . While these are still in beta, they are meant to close the affordability and accessibility gap between the free, incomplete public APIs and the complete enterprise APIs by Twitter’s data vendor Gnip. The second limitation is an obvious one: Twitter does not represent society. Focussing on Twitter data limits what can be discovered about broader human communication. Affordances of that specific platform can also lead to artefacts that cannot be ascribed to general rules of social or human behaviour. However, as already pointed out in sec. 4.2.2, I still see the benefit of working with Twitter data as a ‘toy model’ of human communication to gain experience with data about human communication on this big social data scale.

4.2.6 Ethical Considerations The Office of Research & Integrity at QUT has approved this project under the generic clearance number 1200000491. This clearance is in place for any negligible risk research 19 https://blog.twitter.com/developer/en_us/topics/tools/2017/introducing-twitter-premiumapis.html

118

CHAPTER 4. METHODOLOGY AND RESEARCH DESIGN

that uses only extant collections of data, and publically available Twitter data. This clearance requires that this research is of negligible risk (i.e., there is no foreseeable risk of harm or discomfort, and the only foreseeable risk is no more than inconvenience). The data collected and analysed can contain a mixture of individually identifiable, potentially re-identifiable, and non-identifiable data. While the nature of the data is easily identified and assured, the assessment of the risks involved requires more consideration. To guide the assessment of these risks, in line with the process-based approach of this project, the recommendation of the Association of Internet Researchers (AoIR) Ethics Working Committee was “a process based approach to ethics, which emphasizes the importance of addressing and resolving ethical issues as they arise in each stage of the project” (Markham & Buchanan, 2012, p. 12). Therefore, “rather than prescribing a set of approved practices”, they “suggest a characteristic range of questions that should be asked by internet researchers as well as those responsible for oversight of such research” (Markham & Buchanan, 2012, p. 12). This section addresses some of these questions relevant to this project. This consideration is especially important considering that the goal of this project is to help to establish a framework for future media and communication scholars who are undertaking this kind of research. Even though this project deals with publicly available data, its ethical implications mostly circle around issues of privacy and data protection. This is due to the fact that “data aggregators or search tools make information accessible to a wider public than what might have been originally intended” (Markham & Buchanan, 2012, p. 6). Especially when dealing with large datasets, “it is possible to forget that there was ever a person somewhere in the process that could be directly or indirectly impacted by the research” (Markham & Buchanan, 2012, p. 6). This makes it necessary to assess both possible harm and benefits and the balance between them, for the persons in question. In dealing with big social data, the informed consent of participants can be considered an impossible goal. One could argue, however, that by signing up to Twitter and making their tweets public, users have consented to the terms of service, including Twitter’s privacy policy that prominently states:

4.2. RESEARCH DESIGN

119

“What you share on Twitter may be viewed all around the world instantly. You are what you Tweet!”20 And in further detail: Twitter broadly and instantly disseminates your public information to a wide range of users, customers, and services, including search engines, developers, and publishers that integrate Twitter content into their services, and organizations such as universities, public health agencies, and market research firms that analyze the information for trends and insights. When you share information or content like photos, videos, and links via the Services, you should think carefully about what you are making public. We may use this information to make inferences, like what topics you may be interested in.21 There are two catches to this argument. Although signing up to the terms of service might be absolutely valid from a legal perspective, from an ethical and academic standards perspective, it does not represent informed consent. Terms of service are seldom read; Twitter does not have strong safeguards against under-aged persons signing up; and (even) most adults might not understand the possibilities of what can actually be inferred from the data they share. It is possible to argue that people at least understand the visibility of their tweets as they are shown on their profile pages. Nevertheless, there still remains a sense of ‘perceived privacy’, and Twitter users might have expectations of an “appropriate flow of information” (Markham & Buchanan, 2012, p. 9). This holds especially true if users have a low number of followers and, therefore, a false sense of safety in obscurity due to the sheer number of other Twitter accounts. However, for very prominent Twitter accounts, or accounts that are obviously striving to reach a larger audience, the opposite holds true. Therefore, a case-by-case evaluation is necessary when published research surfaces content, accounts, or groups of accounts, from the obscurity of Twitter as a whole. To ensure that it is a case-by-case decision, the data as a whole have to be protected. In the case of Twitter, this protection is partially ensured by rate limits. The 20 21

https://twitter.com/en/privacy, retrieved on 2.12.2017 ibid.

120

CHAPTER 4. METHODOLOGY AND RESEARCH DESIGN

public APIs allow more convenient automated access to the data than the website, but do not allow the indiscriminate access that accepted Gnip customers receive. In the case of TrISMA, its data access and use guidelines22 prohibit the sharing of the data outside of accredited projects and teams, and advise against the publication of identifiable material. Furthermore, an ethics approval by the participating institutions for the respective project is necessary to gain accreditation. In the case of the data collected, analysed, and stored specifically for this project, standard data safety procedures (e.g., password protection and encryption of data) were in place. When it comes to the presentation of findings, verbatim material imposes the highest risk of being linked back to single accounts, despite anonymisation. Even if the immediate risk of causing harm in this way is negligible, future risk might be significant. “For example, while a participant might not think his or her information is sensitive now, this might change in five years” (Markham & Buchanan, 2012, p. 10). As the main goals of this project are the development of methodology and methods, I do not see the necessity to quote any material verbatim if it has not had public prominence at some point already (e.g., by thousands of retweets or mentions in the press). Nevertheless, I see an ethical obligation to also reassess the re-publication of this kind of content; especially if tweets have been deleted, they should not be made public again as part of a research report. Not only what an account has posted, but also what can be inferred from this post and other activity records (i.e., metadata) can be harmful “to life, to career, to reputation” (Markham & Buchanan, 2012, p. 10) of its user. As an example, in chapter 6, we identify core accounts in dense clusters in the Australian follower network that exhibit a distinctive use of certain keywords in their tweets. Even if an account has never used any of these keywords, it might end up in one of these core groups due to its followings; it can, therefore, be associated with, for example, hard-right politics, leftist-progressive activists, or the LGBTIQ community. Any of these scenarios could put the individual at risk for different reasons. This is especially the case because these associations are the result of data processing and analysis. They are, therefore, constructed, and could be wrongly inferred. It is, therefore, important to protect individual accounts, while still presenting the overall picture of the findings and the 22

https://trisma.org/trisma-data-access-and-use-guidelines/, retrieved 2.12.2017

4.2. RESEARCH DESIGN

121

way in which they were achieved. This last point, however, leads to a new dilemma. If we present the methods with which such results can be achieved, unethical abuse of these methods (with their associated risks, as identified above) cannot be prevented. This leads us to the most complex and most subjective group of questions identified by Markham & Buchanan (2012, p. 11): the benefits for the research participants; the potential risks for participants; and whether “the research is aiming at a good or desirable goal”. There is no way around admitting that the participants will not directly benefit from this study. The risks have been described above, and I believe that it is possible to (at least) keep them negligible by considering them throughout the research process. However, the indirect benefit for the participants – as social media users, citizens, and humans – and the overall desirable goal of this project can be, and has been, argued for; and this argument continues in what follows. Commercially-funded research is happening behind closed doors, and, due to better data access, is probably advancing at a significantly faster pace than academic research in this field. At least, however, academic research remains open to public scrutiny.

Chapter 5

Study 1: Measuring Communication Cascades1 A terrorist attack in Sydney and the decision of Great Britain to leave the European Union – both were causes for a number of ‘viral’ news dissemination and political communication events online. But what distinguishes a ‘viral’ from a ‘not-viral’ or ‘notso-viral’ hashtag? What does ‘virality’ mean, and how does one cascade of ‘infections’ with one piece of information or behaviour differ from another? How can we observe, measure, and express these differences? This study constitutes the first iteration of the cycle described in the research design for this project in chapter 4.2. Starting from objectives (sec. 5.1) rooted in the interpretation of established theories about media and communication and an exploration of corresponding approaches from network science (sec. 5.2), I analyse three large datasets related to the spread of a piece of information, or a behaviour – namely, the spread of two hashtags and a link – on Twitter (sec. 5.3). The results are presented alongside the methods and their evaluation in sec. 5.4. Following this, a discussion in sec. 5.5 interprets these results; shows the relevance of these results and methods for media and communication theory regarding a networked public sphere; and leads to the consideration of further possible investigations from a network science perspective in sec. 5.6. The conclusion in sec. 5.7 gives a short summary of the study and insights 1

This chapter is partly informed by, and draws upon, research that will be published in a forthcoming book chapter (Mitchell & Münch, 2018). However, the chapter at hand takes a perspective more focused on methods and methodology.

122

5.1. OBJECTIVES

123

gained through it, thus providing a basis for the overarching discussion and evaluation in chapter 7.

5.1 Objectives As explained earlier, this study addresses the concept of diffusion within communication networks, and touches on hypotheses regarding gatekeepers, opinion leaders, and one-, two-, or multi-step flows (all discussed in the literature review in sec. 2.4.2.1). It does so via two investigations: one that investigates Twitter data around two hashtags, #sydneysiege and #illridewithyou, that emerged after a terrorist attack (before Christmas 2014 in Sydney); and one that investigates tweets linking to a petition to repeat the Brexit referendum (from the end of May to the beginning of August 2016). Primarily, making use of network science methods and thinking, the overall study aims to connect concepts around the diffusion of news, information, or behaviour with notions of virality and contagion. The datasets (described in more detail in sec. 5.3) were chosen because all three show qualitative differences, especially regarding the nature of the items spread, the expected influence of diffusion channels outside of Twitter, and their perceived ‘virality’. At the same time, this study addresses the vagueness of the term ‘virality’ – oscillating between concepts such as disease-like spread; exponential and logistic growth; its confusion with popularity; and its abuse as a marketing buzzword. This is achieved by exploring, developing, and presenting rule-based quantitative methods to grasp and more precisely define different notions of ‘contagion’ and ‘virality’. In doing so, it builds upon research lines discussed in the following sec. 5.2 – especially regarding the notion of structural virality, the concepts of simple and complex contagion, and outside influences on information diffusion in social media – and contributes comparative evidence and new methods that lead to theoretical implications for these. By taking a close-up approach in terms of the number of cases, data volume, and depth of analysis, it complements larger scale studies regarding these topics by providing a deeper understanding of the limitations of the employed methods in understanding single cases. At the same time it provides means for smaller scale studies to back their findings with empirical, measurable, and comparable results.

124

5.2

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

Background

This section serves, on the one hand, as a reminder of the relevant sections in the literature review in chapter 2. On the other, it carves out some of the details relevant to the understanding of this chapter, and condenses and synthesises the respective concepts to show how they interact. We follow the structure already chosen in the literature review above, to undertake a rough categorisation of theories about media and communication studies on the one hand, and the network science perspective on the other. On the media and communication studies side of the divide, this study is obviously in line with traditional news diffusion studies outlined in sec. 2.4.2.1. It deals with a fundamental question of communication studies: “Who – Says What – In Which Channel – To Whom – With What Effect?” (Lasswell, 1948). In a broader sense, it addresses the concept of gatekeeping (sec. 2.4.2.3), and has the potential to contribute to the discourse around one-, two-, or multi-step flow hypotheses (sec. 2.4.2.4). As it is dealing with loosely connected flocks of people that crystallise around hashtags and links as their condensation nuclei, it also draws on the background of recently rethought concepts of communities and ad-hoc issue publics (see sec. 2.4.2.2). As is shown later in sec. 5.5, notions of virality and contagion inherently contain the abovementioned concepts. Most network science researchers do not seem to be aware of the origins of contagion as a metaphor for social interaction that can be found in social contagion theory, and which emerged in the second half of the 19th century. Tarde (1903), for example, saw ‘contagious imitations’ between individuals and their initiation as an ‘elementary social act’ that has ‘accomplished everything socially’. According to Latour (2012, p. 117), Tarde is, in fact, a long forgotten forefather of Actor Network Theory (ANT) – so advanced in his thinking that Latour portrays “actor-network as a precursor of Tarde”, despite history having seen a reverse sequence. Nevertheless, the metaphor of the quasi-biological, organic spread of information or behaviour has stuck; however, it remains too vague, too often. This is due to the fact that even on the more rule-based network science side of the field, the terms used for ‘viral growth’, ‘virality’, and ‘contagion’ actually describe different concepts. Besides other possible uses of the term, a look into the literature and social media

5.2. BACKGROUND

125

metrics reveals that • Some items of information can be inherently viral; • The process of their spread can be viral (i.e., constituting contagion in differing forms); and • Patterns found in collected data in hindsight can show some sort of ‘virality’ as a structural property. The latter two features have already been addressed in the literature review in chapter 2. An example of the first of these is quite straightforward, and is a common definition of the ‘virality’ of an item in social media analytics: It may be measured for instance in the number of shares, forwards, or retweets per view; and it is strongly correlated with the growth rate of the audience in models of viral contagion. As a prominent example, Facebook used to have this definition as one of the metrics it offered to administrators of its pages (i.e., until 2012). If we consider contagion as a process, the concepts of simple and complex contagion, which were introduced in the literature review in sec. 2.5.4.1, come into play. Considering the structure of contagion cascades, we can make use of the Wiener Index, or structural virality (see sec. 2.5.4.2): the longer the average shortest path length between the nodes in a diffusion cascade, the higher the structural virality; the shorter, the more the cascade can be considered to be a ‘broadcast’ cascade (see fig. 5.1).

Figure 5.1: Depiction of a pure broadcast and structurally more viral cascade. While both have the same number of nodes, the average shortest path length is longer on the right.

126

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES This measure of virality is practical, as it enabled Goel et al. (2015) to take a

bird’s eye view of a large set of diffusion cascades on Twitter, and on ways to model them. However, as does the definition of the inherent ‘virality’ of an item, it ignores the process taking place during its spread. This process differs from the spread of a disease if it is, for example, a socially risky behaviour that might exhibit a complex contagion dynamic. ‘Structural virality’, as defined by Goel et al. (2015), is actually undefined if an item has more than one source. This study addresses this problem by testing two ways to circumvent this definition problem. At the same time, it confirms expectations regarding the relative structural virality of the analysed cases. Also from a bird’s eye view, Romero et al. (2011) investigated the complexity of contagion for a large number of diffusion cascades on Twitter (see also sec. 2.5.4.1). They found evidence that while the contagion probability of a political hashtag improves up to a high number of exposures, this is not the case for ‘idiomatic’ hashtags (i.e., tags “representing a conversational theme on Twitter” (Romero et al., 2011, table 1) such as #cantlivewithout, #dontyouhate, or #musicmonday (Romero et al., 2011, table 2)). However, the number of possible exposures was based on a proxy, namely an @mention network: If some account had mentioned another a certain number of times, they assumed that this account would pay attention to the mentioned account. While this study employs both of these concepts and analyses exposure based on follow networks instead of @mention networks, it does so from a closer perspective on three cases to gain a better understanding of the actual meaning of these measures. Also here, expectations based on the findings of Romero et al. (2011) and theories regarding complex contagion are confirmed by this study. In its close-up approach, this study is comparable to the study undertaken by Giglietto & Lee (2017), who looked at discursive strategies around the hashtag #JeNeSuisPasCharlie. This hashtag, and the hashtag #illridewithyou that is featured in the study documented here, are examples of “a recently growing trend … that users — especially when facing controversies, conflicts, and crises — choose a pithy phrase that serves as a “mini statement” in its own right” (Giglietto & Lee, 2017, p. 12). However, while Giglietto & Lee (2017) employed time series analysis, manual coding, network visualisations, and automated forms of content analysis to understand the ‘who?’, ‘what?’, and ‘why?’, this study focusses on network analysis to explore ‘in

5.3. CASES AND DATA DESCRIPTION

127

which channel?’, ‘to whom?’, and ‘with what effect?’. This study can also be seen as a further contribution “towards a typology of hashtag publics” (Bruns et al., 2016), albeit from a network science perspective. While Bruns et al. (2016) used comparatively easily accessible and understandable aggregate counts – such as the number of retweets and tweets containing a link in a hashtag-based data collection – to, for example, categorise hashtags as ‘media events’ or ‘acute events’, this study adds another dimension by investigating the structure and dynamics of their spread.

5.3 Cases and Data Description As earlier mentioned, this study analyses and compares three cases of an item spreading on Twitter: two hashtags, #sydneysiege and #illridewithyou; and a link to a petition to repeat the Brexit referendum2 . In this section, I now provide a short summary of the events leading to the popularity of these items on Twitter, and describe the data gathering process.

5.3.1 #sydneysiege and #illridewithyou 5.3.1.1 Event From the morning of Monday 15 December 2014, a gunman held captive 18 customers and employees of the Lindt Chocolate Café in Martin Place, Sydney. News and social media quickly took up the event, one reason being that the café is located opposite the Seven News television studios. From the very beginning, the event was referred to as the ‘Sydney Siege’ by the mainstream media. In the early morning of the next day, a police Tactical Operations Unit ended the siege. The gunman and two hostages were killed during the raid3 . This study focuses on two hashtags that emerged on Twitter during the incident: #sydneysiege, which was mainly used to tweet about the event in general, and #illridewithyou, which came into being as a counter-reaction to Islamophobic sentiments that surfaced during and after the hostage crisis. #illridewithyou was used as a sign 2 3

https://petition.parliament.uk/archived/petitions/131215 see, e.g., https://au.news.yahoo.com/nsw/a/25772503/siege-situation-in-martin-place/

128

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

of solidarity with Muslims who were afraid of public repercussions, by showing a willingness to protect them on public transport. Notably, this “beautiful movement that will restore your faith in humanity”4 began with two tweets (on the Monday evening) from an account that (at the time of writing) has scarcely more than 3000 followers: If you reg take the #373 bus b/w Coogee/MartinPl, wear religious attire, & don’t feel safe alone: I’ll ride with you. @ me for schedule5 and Maybe start a hashtag? What’s in #illridewithyou?6 While #sydneysiege has been the more widely used hashtag (see sec. 5.3.1.2), #illridewithyou was reported by online news media as having gone ‘viral’7 . To investigate this perceived qualitative difference regarding ‘virality’, in contrast to the mere number of uses, is one goal of this study.

5.3.1.2

Data Description

As both hashtags found a global audience, the collection of tweets by identified Australian Twitter users provided by TrISMA (see sec. 4.2.2) was not suitable for this study. Furthermore, buying historical data at the scale necessary from Gnip would have been prohibitively unaffordable for this project. Due to the acute nature of the event, some tweets also had to be collected in hindsight. This led to a hybrid approach, using the Search and the Streaming APIs of Twitter in a complementary way.8 Both APIs have an endpoint delivering tweets and their metadata containing requested keywords or hashtags.9 While the Streaming API provides a complete dataset most of the time, it is a rate-limited live feed: Using the publicly available APIs, if the 4

https://www.businessinsider.com.au/illridewithyou-goes-viral-as-australians-band-togetheragainst-islamphobia-2014-12 5 https://twitter.com/user/status/544375598286512130 6 https://twitter.com/user/status/544363674505199616 7 see, e.g., http://www.businessinsider.com.au/illridewithyou-goes-viral-as-australiansband-together-against-islamphobia-2014-12, http://www.smh.com.au/nsw/martin-place-siegeillridewithyou-hashtag-goes-viral-20141215-127rm1; https://www.aljazeera.com/news/asiapacific/2014/12/illridewithyou-goes-viral-after-sydney-siege-2014121512387983113.html 8 This approach was developed, and the data collection executed, by Darryl Woodford and Katie Prowd. See also Woodford et al. (2017) for an instructive overview of the investigation of social media audiences beyond Twitter. 9 A detailed documentation of all Twitter APIs can be found here: https://developer.twitter.com/en/docs

5.3. CASES AND DATA DESCRIPTION

129

amount of tweets surpasses around 1% of the global amount of tweets sent out on Twitter, the output is rate-limited. Beyond the limit, the way the sample is drawn is undocumented. Furthermore, the Streaming API does not provide any historical data, but only pushes tweets in real-time to a listening data collector. If this collector should fail or not yet be running, the tweets are not retrievable a posteriori via this API. In contrast, the Search API provides the possibility to retrieve tweets for about a week after they are posted. Its output resembles the results one would get from searching for tweets via the Twitter website. However, this data might not be complete (as Twitter points out in its documentation). It represents a sample of the tweets deemed the most relevant by some undocumented algorithm. Nevertheless, if the volume of tweets is not too high, spot-checking of datasets available to the DMRC shows that a majority of 90% or more of the matching tweets is usually retrieved in these cases. As pointed out by Tromble & Storz (2017), this value can drop significantly if the number of tweets containing the requested keyword is very high, or (even) hits the rate limit on the public Streaming API. The approach followed for this study combines both APIs to make up for the fact that the hashtags #sydneysiege and #illridewithyou emerged spontaneously, and did not allow for a pre-planned collection. After #sydneysiege had emerged as a popular hashtag on Twitter, a collection via the Streaming API for this hashtag was started using DMI-TCAT. At the same time, a backfill via the Search API for the tweets that had been missed was undertaken. The first tweet collected via the Streaming API was posted on Monday at 00:58 Coordinated Universal Time (UTC) (i.e., 11:58 in Sydney), while the Search API retrieved the earlier tweets from Sunday 23:05 UTC (10:05 Australian Eastern Daylight Time (AEDT)) onwards. The start of the collection of the hashtag #illridewithyou was automatically triggered when its prevalence in tweets tagged with #sydneysiege crossed a previously set percentage threshold. This started the same process for this hashtag as for #sydneysiege (i.e., a collection of forthcoming tweets containing the hashtag via the Streaming API), and a collection of tweets in the hours beforehand, via the Search API). This study deals with the early stages of the spread of both hashtags, when the number of tweets was well below the maxima of registered activity. Therefore, it is highly unlikely that rate limits affected the data collected via the Streaming API.

130

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

Accordingly, numbers were even lower for the collection via the Search API, as it collected data even before these stages of comparatively low activity. This, and the fact that no abrupt discontinuities in volume around the known transition points between both APIs could be found, led us to consider that the dataset was reasonably complete.

Figure 5.2: Running sum of the number of collected tweets containing the hashtag sydneysiege. Reference lines mark minima and maxima of dates and running sum.

The whole dataset of collected tweets containing #sydneysiege comprised around 690 000 tweets, and covered a time period between 10:05 AEDT on 15 December 2014 and 14:38 AEDT the following day (see fig. 5.2). About 371 000 tweets containing #illridewithyou were collected. They were tweeted between 16:29 AEDT on 15 December 2014 and 10:32 AEDT the following day (fig. 5.3). Fig. 5.2 and fig. 5.3 show that both datasets exhibit a sigmoid growth-curve, resembling an S-shape that is characteristic of the logistic growth curve exhibited by many exponential growth functions approaching an upper limit. The same holds true for the running sums of new accounts entering the conversation; that is, using the respective hashtags for the first time (figs. 5.4, 5.5).

5.3. CASES AND DATA DESCRIPTION

131

Figure 5.3: Running sum of the number of collected tweets containing the hashtag illridewithyou. Reference lines mark minima and maxima of dates and running sum.

Figure 5.4: Running sum of new accounts using the hashtag sydneysiege. Reference lines mark upper limits of the data analysed for this study.

To reconstruct the spread of hashtags, it is necessary to also have information about the follow connections of these accounts. This was done for the first 10 000

132

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

Figure 5.5: Running sum of new accounts using the hashtag illridewithyou. Reference lines mark upper limits of the data analysed for this study.

accounts that tweeted the respective hashtags. The Twitter handle, the account ID, and the account creation date of the accounts they were following (documented as their ‘followings’ or ‘friends’, from here on) were collected between the end of May10 and the beginning of June 2015.11 Unfortunately, for collecting followings, the rate limits of the Twitter API are restrictive, and collecting all accounts would take several months. However, as visible in fig. 5.4 and fig. 5.5, the first 10 000 accounts are enough to analyse the early phases of the diffusion process gathering speed, as can be seen by the curvature in the growth line. 10 While the time gap between collecting the follower networks and the event is regrettable, and suggests the opportunity to improve future research by starting this collection process at the same time as retrieving the tweets in times of acute events, it is acceptable within the scope of this study. This is because: 1) This study focuses on exploring methods more than on the results related to the events themselves; and 2) it is a reasonable assumption that the dominant structures of follower networks do not change quickly. 11 Python scripts available on Github: https://github.com/FlxVctr/PhDcode/blob/master/get_friends_of_useridlist.py, and https://github.com/FlxVctr/PhDcode/blob/master/get_friends_of_userlist.py

5.3. CASES AND DATA DESCRIPTION

133

5.3.2 Brexit Repeat Referendum 5.3.2.1 Event The second diffusion event analysed in this study took place around the European Union Referendum Act (mostly dubbed as the ‘Brexit referendum’) regarding the question of whether the United Kingdom (UK) should leave or remain in the European Union (EU). After the result – that a small majority of 51.89% had voted to leave the EU12 – was announced on the morning of 24 June 2016 at 7:20 British Summer Time (BST) (UTC+1), an, until then, rather dormant petition13 became so popular that the traffic created by the signees temporarily crashed the official petition website of the British parliament. Slow-downs and crashes were reported (especially in the morning), and temporarily rendered the petition completely inaccessible14 . Nevertheless, the threshold of 100 000 signatures needed for consideration for debate in parliament was reached before noon. On its closure, the petition had attracted over 4 million signees. The petition was originally started by a Brexit supporter at a time when polls suggested that the Remain-faction would be successful.15 It asked that the referendum be repeated if voter turnout was less than 75%, and if the majority vote to remain or leave was less than 60%. Both scenarios eventuated. However, after debate in parliament, the petition was rejected by the British government.

5.3.2.2 Data Description The dataset for this analysis was collected by members of the Oxford Internet Institute as part of a collaborative project with Scott Hale. It is part of a larger collection of tweets containing links to the parliament’s petition website, retrieved using the Twitter Search API. Despite its limitations, use of the Search API was necessary because, at the time of collection, it was not possible to track all links to websites within a domain 12

https://en.wikipedia.org/wiki/United_Kingdom_European_Union_membership_referendum,_2016#Result retrieved 16.12.2017 13 https://petition.parliament.uk/petitions/131215 14 see, e.g., https://www.independent.co.uk/news/uk/brexit-petition-for-second-eu-referendum-sopopular-the-government-sites-crashing-a7099996.html; and http://metro.co.uk/2016/06/24/petitioncalling-for-a-second-eu-referendum-crashes-because-its-so-popular-5964230/ 15 https://en.wikipedia.org/wiki/United_Kingdom_European_Union_membership_referendum,_2016 (retrieved 14.12.2017)

134

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

(i.e., in this case, all links containing ‘petitions.parliament.uk’) via the Streaming API, which did not support partial matches of website addresses.

Figure 5.6: Running sum of all collected tweets containing the link to the petition. Reference lines mark minima and maxima of dates and running sum. Vertical axis is logarithmic to enhance visibility of first tweets on this scale. One dot represents one hour.

As can be seen in fig. 5.6, the collection starts with a single tweet on 27 May 2016 @ 17:00 UTC, four days after the petition was started. It apparently then lies completely dormant on Twitter until an abrupt explosion of attention on 24 June, starting at 04:00 UTC – interestingly, two hours before the Brexit result was officially announced, with the first collected tweet indicating that the petition gained 25 signatures16 . It then plateaus within a couple of days at around 63 000 tweets by around 51 000 accounts. As shown in fig. 5.7, around 17 000 of these tweets (i.e., almost one third) were recorded on 24 June UTC. Compared to the two other datasets described in sec. 5.3.1.2, it is also apparent that tweeting about this event mainly took place in one single time zone, as there is a clear drop in the speed of diffusion during the night. 16

https://twitter.com/user/status/746199614982754306

5.3. CASES AND DATA DESCRIPTION

135

Figure 5.7: Running sum of collected tweets containing the link to the petition recorded during the week after 24 June 2016.

Figure 5.8: Running sum of collected tweets containing the link to the petition recorded on 24 June 2016. Reference lines mark limits of times between which no tweets have been recorded.

Fig. 5.8, showing the cumulative sum of tweets on the day after the Brexit referendum, exhibits three time periods, each lasting around 20 to 30 minutes, during

136

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

Figure 5.9: Screenshot of share buttons on the petition website

which no tweets were recorded. Whether these periods of inactivity were caused by technical problems, Twitter rate limits, or the petition website being unreachable is not clear. However, the existence of social sharing buttons on the petition website itself (fig. 5.917 ), and the fact that the first recorded gap occurred at around the time when complete outages of the website were reported, suggest – in line with the study’s findings presented in sec. 5.4 – that the website itself was the main driver behind this diffusion event. This would mean that these periods of inactivity are, indeed, related to outages of the website. This has to remain an assumption, however, and the effect of possibly missing data on our results is discussed towards the end of this chapter. Unlike the two datasets above, the followings of all accounts that tweeted and were found in the collected dataset have been collected. This was possible due to the considerably lower number of accounts. The data collection took from mid-August until mid-September 2016.

5.4

Analysis

The three cases described above exhibit qualitative differences in their origin and their spread via social media: the petition became popular largely outside of Twitter; news about the Sydney Siege spread rapidly on Twitter and through other media (explainable by the fact that it was a surprising, life-threatening event in the immediate vicinity of major TV studios); and #illridewithyou constituted a native Twitter development (that should not exhibit major outside influences in its early stages of diffusion). Do these differences, however, lead to differences in network structures that are observable, intuitively understandable, and that quantifiably verify what seems qualitatively obvious? Can we confirm the hypothesis that items that inherently carry some social risk for the sharing actor exhibit more complex contagion? Does a Twitter-born, opinion17

Source: https://web.archive.org/web/20160624140735/https://petition.parliament.uk/petitions/131215, retrieved on 17.12.2017

5.4. ANALYSIS

137

ated, ‘grassroot’ hashtag (#illridewithyou) exhibit higher structural virality, stronger impact of single actors, higher speed of spread, i.e. higher ‘virality’ than a purely descriptive hashtag regarding the same topic that is influenced from outside the platform (#sydneysiege)? This section describes the methods applied, and the corresponding results that answer these questions in the affirmative. Sec. 5.4.1 is an exception, as it does not apply network science methods. However, the use of the metric of new and returning users – a common concept in website analytics – provides the possibility of triangulating and better understanding the results described in the following sections. Sec. 5.4.2.1 describes how the data has been processed to provide us with a diffusion tree network, representing the most likely paths of the spreading hashtags, and the links from account to account. The analysis of the resulting networks is described in subsequent sections. However, the diffusion trees are simplifications in their underlying assumption that a single exposure to an item leads to its use. Therefore, in sec. 5.4.3, this assumption is dropped, and the analysis of two other networks – an exposure and an influence network regarding exposure times and intensity – is described. For a proper understanding of the results reported, it is necessary to discuss the methods that led to them. The measures used exhibit a certain complexity that leads to a more obvious need for qualitative interpretation than, for example, simple descriptive statistics about aggregate figures. Therefore, the methods are discussed alongside the results to help their interpretation, and to turn them into findings about the cases examined. A summary of the results, and an interpretation of the virality of the cases, can be found in sec. 5.4.4. After this section, the discussion in sec. 5.5 reflects on the employed methods, and the implications of this study for media and communication theory. sec. 5.6 provides an overview of further possible and necessary analysis to gain a better understanding of these and other cases of information, opinion, and behaviour diffusion on social media.

138

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

5.4.1

New and Returning Accounts over Time

5.4.1.1

Method and Results

To gain a first intuitive understanding of the diffusion dynamics taking place in all three cases, I chose to fall back on a common concept in website analytics: new and returning users. Reinterpreted as the sharing of something on Twitter, this concept translates to accounts that share an item for the first time, and accounts that are re-sharing an item.

Figure 5.10: Moving average over 60 minutes of new vs returning users using the hashtag sydneysiege per minute. Reference line marks end of data included for analysis from next section on.

Fig. 5.10, fig. 5.11, and fig. 5.12 show the running average over 60 minutes of the number of returning and new users per minute. When compared to the sharing of the two hashtags, the shares of the petition link show a striking difference in relation to this measure. While for #illridewithyou and #sydneysiege, the running averages of returning and new users remain comparably close over time, the petition link has been shared only once by a vast majority of users. This seems intuitively understandable, as repeated sharing of a link might be considered ‘spammy’. Moreover, it sits well with an interpretation of the likely user story of clicking the link, signing the petition, and using the social share buttons on the website, without then spending further time on Twitter. A closer look at the sharing of the hashtags also reveals some differences. New

5.4. ANALYSIS

139

Figure 5.11: Moving average over 60 minutes of new vs returning users using the hashtag illridewithyou per minute. Reference line marks end of data included for analysis from next section on.

Figure 5.12: Moving average over 60 minutes of new vs returning users tweeting the link to the petition per minute.

accounts using #illridewithyou peak during the first wave of diffusion at an average of ca. 450 per minute. This is a value that #sydneysiege does not even reach during the pronounced peak during the early morning hours of 16 December, coinciding with the time when the attacker was shot by police.

140

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES Overall, while in both cases the start of the first wave of diffusion is characterised

by a slightly higher number of new than returning users – as one would expect for an issue that receives steeply growing attention – this ratio turns for #sydneysiege. The number of accounts remaining active in using the #sydneysiege hashtag stays constantly above the number of new users joining; for #illridewithyou, on the other hand, it is the reverse.

5.4.1.2

Discussion

On the one hand, one could interpret this as evidence that the mention of the hashtag #illridewithyou is driven by contagion into new groups, while #sydneysiege is more continuously used by the same accounts. On the other hand, this could also simply be evidence that #illridewithyou was tweeted once by the accounts and then left alone, while #sydneysiege showed a similar growth, but was more ‘sticky’ (i.e., was driving an ongoing conversation). Therefore, while giving a good overview and background for interpreting further results, these numbers cannot give a full picture of the contagion process, and a deeper understanding of the virality of these three items is needed. As mentioned, the ratio of new and returning accounts is not a network measure. However, it could be interpreted as a measure of some notion of virality. While there is, after the initial stages of contagion, on average, a constantly lower number of new accounts using the hashtag #sydneysiege compared to returning accounts, the opposite is true for #illridewithyou. And, in the case of the link to the petition, the number of new accounts is clearly playing the dominant role. This might lead to the conclusion that, compared to the total size of the contagion, despite its lower popularity as expressed by total numbers, the petition is attracting more new users and could, therefore, be seen as more viral. However, it is necessary to take into account the fact that the ‘nature’ of a link is different to that of a hashtag. A hashtag is intended to be reused to create some sort of communication channel, reminiscent of channels on Internet Relay Chat (IRC). To post a link more than once bears the danger that it will quickly be regarded as annoying. Therefore, this result for the petition surfaces the qualitative differences between a link and a hashtag. If the assumption of a reluctance to share links more than once holds true in general,

5.4. ANALYSIS

141

it would mean that a link per se would find it harder to gain attention via organic sharing than a hashtag; this is because it is more difficult to make a link visible in the timelines of the platform users. The results for the two hashtags are more comparable – not only because they are both hashtags, but also because they became popular at the same time, were caused by the same event, and reached a comparable audience. Here too, however, the caveat explained above – that is, that a higher number of new users compared to old users might just be a sign that the hashtag is less ‘sticky’ (i.e., less likely to drive an ongoing conversation while still attracting a comparable number of new, one-time users) – remains valid. In summary, the measure of new and returning accounts seems simple and easy to understand on first sight, but might mislead an analyst into making quick, simple interpretations. The reduction of a complex contagion process to a time series of averaged aggregate numbers omits too many crucial dimensions to tell the story on its own. However, it does add an important angle to the further analysis, and helps with an understanding of the following results.

5.4.2 Analysis of Diffusion Trees The data collected (as described in sec. 5.3) allowed the reconstruction of diffusion tree networks for all three cases. Their construction and analysis is explained in this section. While still being a simplification, it proved itself helpful in better understanding the dynamics of diffusion, and the limitations of the data. It also added the necessary complexity to the analysis of the contagion processes at hand.

5.4.2.1 Reconstruction of Diffusion Tree Network At least for the first two cases, the chronological timeline, determined by the followings of users, was arguably the main channel of diffusion on Twitter. The Sydney Siege occurred before Twitter introduced a timeline (sorted by a Facebook-style algorithm) in February 2016 for all users that did not opt out. It remains the main channel for users who disabled a sorted timeline in favour of a chronological one; even the sorted timeline still resembles a chronological feed of tweets posted by followed accounts. Therefore, it

142

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

was necessary to acquire information about how the items in question spread via this channel. Goel et al. (2015), through elevated API access, could work with the complete dataset necessary; this study, like most others, could not. Romero et al. (2011) solved this problem by using @-mentions as a proxy for attention between users. The assumption of a possible contagion among links between accounts that mentioned each other a minimum number of times, made it possible to investigate network magnitudes bigger than ours. However, the approach by Romero et al. (2011) necessarily contains false negatives, especially from accounts not using Twitter in a conversational way (i.e., they rarely mention others). Furthermore, the assumption that users would not pay attention to accounts they are following, even if they happen not to interact with these accounts through @mentions, seems quite a bold one. Therefore, this study collected the following networks of the first 10 000 users active in the diffusion events for #sydneysiege and #illridewithyou, and for all of the ca. 50 000 users in our dataset for the petition (see sec. 5.3). However, due to the collection of this data a few months after the events (see sec. 5.3), some limitations apply: • It is not possible to know if there was, at the time, a follow connection between two accounts that no longer exists, leading to false negatives. • Accounts deleted in the meantime, especially bots or spam accounts, are contained in our dataset, also leading to false negatives. • Connections made after the event will be in the dataset, leading to false positives. The first two of these limitations cannot be effectively addressed. To minimise the last problem, a lower bound for the follow date of every connection has been determined, using an approach described by Bruns & Woodford (2014). This makes use of the fact that the Twitter API returns the followers and followings of an account in reverse chronological order; that is, from the latest followed account to the first followed account. Combining this with the openly accessible creation dates of Twitter accounts, it is possible to determine the latest possible date at which one account A followed another account B: the latest creation date of all accounts that followed account B

5.4. ANALYSIS

143

after account A.18

Figure 5.13: Depiction of Graph Database Schema for Twitter Data

This data was imported into the graph database Neo4j (Robinson et al., 2013), and merged with the tweet data, following the schema depicted in fig. 5.13: • An account record can be linked to another account via a ‘FOLLOWS’connection. • An account can have ‘TWEETED’ a tweet. • A tweet optionally ‘RETWEETS’, ‘REPLIES’ to, or ‘QUOTES’ another tweet. All actions were timestamped, the follow connection with the lower bound of the follow date. Quoted tweets were only introduced in 2015 and, therefore, are absent from the #sydneysiege and #illridewithyou cases. The next step was to determine the most likely path the contagion took, following a similar approach as described by Goel et al. (2015, p. 15), who estimated the accuracy of their approach to be 95%. Both approaches try to find the most likely source account for every ‘adopter’ of the hashtag or link, in several steps19 : 18 A Jupyter notebook with the Python code used can be found here: https://github.com/FlxVctr/ PhD-code/blob/master/determine_followed_after_date.ipynb 19 A Jupyter notebook with the Python code used can be found here: https://github.com/FlxVctr/ PhD-code/blob/master/determine_followed_after_date.ipynb

144

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

1. For every account A, retrieve its first tweet T containing the hashtag or link. 2. Get information about whether it is a retweet, a quote, or a reply to another tweet, and retrieve the account B that tweeted this other tweet. 3. For A, retrieve all its followings that existed before T was tweeted. 4. If T is a retweet of tweet S: I. If the retweeted account B is followed by A: B is the source account. II. If B is not followed by A: i. Check whether S was retweeted before by some account C, whom A followed. ii. If yes: the latest retweeter C is the source. iii. Otherwise: B is the source. 5. If T is a quote of tweet S: Proceed as with retweets from Step 4. 6. If T is a reply to, or mentioned account B: I. If B has tweeted the hashtag or link before: i. If B is followed by A: B is the source account. ii. If B is not followed by A: a. Check whether any account C followed by A has mentioned, retweeted, or quoted B before. b. If yes: the account C that made the last of these mentions, retweets, or quotes is the source. c. Otherwise: B is the source. 7. If T is none of the above types of tweets (i.e., an original tweet): I. Try to find the most recent tweet containing the hashtag or link before T by any account B followed by A. II. If found: B is the source. 8. If no source can be determined: A is a ‘root’ account; (i.e., we assume it has introduced the hashtag or link independently from the mechanisms above, perhaps due to exposure to the news media, Twitter’s trending topics, or the petition website).

5.4. ANALYSIS

145

In short, the assumption is that the timeline, made up of tweets by the accounts that the account in question follows, is the most important mechanism of diffusion. For every source node found, a directed adoption link from the adopting node to the source node was created, resulting in a tree network structure.

5.4.2.2 Connected Components

Figure 5.14: Force-directed visualisation of the diffusion tree network of the hashtag illridewithyou for the first 10 000 accounts using it, coloured by weakly-connected components

Method and results

The diffusion tree network constructed through this process

has been visualised for all three cases, as the basis for an exploratory analysis. The visualisations in fig. 5.14, fig. 5.16, and fig. 5.18, show the diffusion trees for #illridewithyou, #sydneysiege, and the petition link, respectively. (For reasons of comparability, only the first 10 000 users tweeting the link have been visualised.) These visualisation layouts were determined with Gephi’s Force Atlas 2 layout (as described in Bastian et al. (2009)). Due to its interactivity, it is especially suited for exploratory visualisations. Visual inspection and comparison of these network visualisations allowed the building of hypotheses about the structure of the three diffusion trees, with the known qualitative differences between the underlying events in mind. The first visually obvious

146

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

Figure 5.15: Percentages covered by the largest weakly-connected components of the diffusion tree network for the first 10 000 accounts tweeting the hashtag illridewithyou

Figure 5.16: Force-directed visualisation of the diffusion tree network of the hashtag sydneysiege for the first 10 000 accounts using it, coloured by weakly-connected components

5.4. ANALYSIS

147

Figure 5.17: Percentages covered by the largest weakly-connected components of the diffusion tree network for the first 10 000 accounts tweeting the hashtag sydneysiege

Figure 5.18: Force-directed visualisation of the diffusion tree network of the petition link for the first 10 000 accounts tweeting it, coloured by weakly-connected components

148

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

Figure 5.19: Percentages covered by the largest weakly-connected components of the diffusion tree network for the first 10 000 accounts tweeting the link to the petition

difference was the relative size of the weakly-connected components20 . This finding could be confirmed through further analysis, as seen in fig. 5.15, fig. 5.17, and fig. 5.19. For #illridewithyou, the five biggest components make up more than 86% of the network, with the largest component containing over three quarters of all diffusion paths, meaning that we can trace back more than 77% of all shares to one starting tweet. Actually, the root account of this tree is indeed the account that tweeted the first tweet (as quoted above in sec. 5.3.1.1). For #sydneysiege, the two biggest components still make up more than one quarter each, the five biggest components accounting for more than 71%. For the link to the petition, however, the five biggest components comprise less than 18% of the network. Discussion The analysis of the weakly-connected components in a diffusion tree network is appealing because, compared to other network measures, it is fairly easy to understand and, at the same time, applicable to very large datasets due to its computational simplicity. 20

Weakly-connected components, as opposed to strongly-connected components, ignore the direction of edges.

5.4. ANALYSIS

149

The number of connected components is effectively the number of root nodes (i.e., the number of entry points of an item to the platform, corresponding to the number of nodes with an in- or out-degree (depending on the definition underlying the constructed network) equal to zero. However, to make this number comparable across different events, it is of course necessary to normalise it by using its maximum, the number of nodes N , leading to a measure of connected components n per node:

ν=

n N

(5.1)

Defined as such, ν can be indicative of an important property of the contagion. A greater influence from channels outside of the platform(s) from which the data has been collected, will lead to a higher ν. This will even hold true if data has been collected for several platforms, and we are analysing a multilayer network. Caution is advised, however, if one contagion happens markedly faster than the other: If data is collected only for the initial stages of the contagion event, a lower ν can also simply indicate that the speed of growth of the dominant component was high. Furthermore, this measure will not be very robust in the case of missing data, as each missing link in the diffusion tree will appear as a new component. As the analysis in this study only included the first 10 000 accounts that posted the items in question, a normalisation is not necessary. The non-normalised number showed the highest number of components for the link to the petition, and the lowest for the hashtag #illridewithyou. Assuming that our data is reasonably complete, and taking into account that all three items reached their maximum intensity within comparable timeframes, this indicates the highest influence from outside of Twitter for the petition links. This makes sense, not only because hashtags are more seldom used outside Twitter than links, but also because it is a reasonable assumption that many original tweets about the petition were posted via the social share buttons on the petition website (fig. 5.9), by people who found the petition through other channels. Furthermore, the fact that, according to ν, #sydneysiege has more outside sources than #illridewithyou, is also explainable by qualitative knowledge about the events: because #sydneysiege quickly became the standard hashtag to tweet about the event, it was also used by news media outlets (e.g., on their websites) already in the

150

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

early stages of diffusion. Not only might this have triggered individual users to tweet with this hashtag independently of the accounts they follow, but also, Twitter accounts used by news media might not be following or retweeting each other if they belong to competing publishers. This would lead to a contagion of the hashtag through channels other than Twitter (e.g., the editorial boardroom). However, before news media took it up as having gone ‘viral’, #illridewithyou was a truly organically Twitter-grown hashtag in its early stages. Therefore, the result that its diffusion tree network has the smallest number of connected components is conclusive. Besides their number, the relative size of the components provides insights into the dominance of the biggest components, and can easily be transformed into a centralisation measure. For example, this could be the size Nm of the biggest component, compared to the average of the sizes Ni of all other components i ̸= m. The ‘component centralisation’ Γ would then be calculated as Nm i̸=m Ni /(n − 1)

Γ= ∑

(5.2)

with n being the total number of components. This is also generalisable to take the k largest components into consideration: ∑

Ni i>k Ni /(n − k)

Γk = ∑

i=1...k

(5.3)

with N1 to Nk being the number of nodes in the k largest components. Table 5.1: Component centralisations Γ and Γ2 for the diffusion trees of illridewithyou, sydneysiege, and the petition link

#illridewithyou

#sydneysiege

petition

Γ

606

97

206

Γ2

736

310

301

The results of this calculation for our cases can be found in tbl. 5.1. As expected, #illridewithyou shows the highest centralisation, with one component covering more than three quarters of the diffusion tree, 606 times larger than the average component. The diffusion tree of #sydneysiege exhibits a component centralisation that – at 97 –

5.4. ANALYSIS

151

is six times lower, even lower than the link to the petition. This reflects the fact that the contagion tree of #sydneysiege contains a second component of comparable size to the largest one. Γ2 confirms this, as #sydneysiege surpasses the petition link once the average of the two largest components is compared to that of the rest. In this case, these results were already obvious before calculation. However, after validating them with more datasets, and inspecting its behaviour in edge cases, the component centralisation score should provide, in combination with the number of components, a useful metric. This could be used to assess outside influence, and the dominance of the largest cascade in diffusion events, if a birds-eye view of a large number of events is needed (e.g., to classify a large number of events).

5.4.2.3 Closeness Centrality Distributions Method and Results

A comparative profiling of the #sydneysiege and #illride-

withyou diffusion trees inspired a closer look at the behaviour of harmonic closeness centrality to investigate the expected differences in the network properties of the diffusion trees. The analysis of a set of common network measures21 revealed that, while other measures did not show striking differences, the distributions of closeness centrality were notably distinct. However, closeness centrality is not properly defined for graphs that are not fully connected. The closeness centrality for a node i in a network is defined as the inverse of the sum of shortest distances dij from all other nodes j in a network; that is:

Ci = ∑

1

j̸=i dij

(5.4)

or in its normalised form:

Ci = (N − 1) ∑

1

j̸=i dij

(5.5)

with N being the number of nodes in the network. For unconnected nodes, the distance term in the divisor would become infinite and, therefore, the closeness of every node would be zero. However, closeness centrality is often implemented in network analysis software in a way that simply ignores missing 21

The profiling method implemented in NetworKit proved helpful for this.

152

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

links. While this circumvents the problem, and still lets one compare nodes within connected components, it renders the measure useless for comparing nodes in different components. For example, the closeness centrality of two nodes connected only to each other would be 1 for each of them. For another node in a bigger component, it would necessarily be lower; this contradicts the common interpretation of closeness (i.e., the ease with which the node in question can reach all other nodes in the network). Harmonic closeness centrality is a way to generalise the common network measure of closeness centrality to a not fully connected network. It was introduced independently by Decker (2005) and Rochat (2009), the latter giving it its (now) most common name, and usually shows high rank correlation with the standard closeness centrality measure. At the same time, it enables a comparison across components. The harmonic closeness centrality of a node i is defined as the sum of the inverse shortest distance dij from all other nodes j:

Hi =

∑ 1 j̸=i

dji

(5.6)

or in its normalised form:

Hi =

1 ∑ 1 N − 1 j̸=i dji

(5.7)

with N being the number of nodes in the network. It has to be noted that both kinds of closeness behave fundamentally differently for directed and undirected networks. It will be instructive to compare both cases for the examples at hand. In the undirected case, harmonic closeness will be higher for a node with many close neighbours than for a node with only a few close neighbours. At the same time, a node in a larger component will have a higher harmonic closeness than a node in a smaller component with a comparable network structure in its vicinity. In our case (i.e., for a diffusion tree), this property already makes it possible to interpret harmonic closeness as a measure of the closeness of a node to the centre of the largest diffusion events. Figs. 5.20, 5.21; and fig. 5.22; show the visualisation of the diffusion tree, ignoring the direction of the edges for all three events. The visualisations use the implemen-

5.4. ANALYSIS

153

Figure 5.20: Force-directed visualisation of the diffusion tree network of the hashtag illridewithyou for the first 10 000 accounts using it, coloured by the undirected normalised harmonic closeness centrality relative to the minima and maxima found in the network (minimum: blue; middle: red; maximum: yellow)

tation of the layout algorithm by Hu (2006) in graph-tool (Peixoto, 2017b), and are coloured by the normalised harmonic closeness centrality. This visualisation, and an analysis of the distributions of harmonic closeness, provides insights into the relative importance of single clusters or nodes for the whole network. The diffusion tree for #illridewithyou (fig. 5.20) shows that the relative closeness to the epicentres of the diffusion events of a majority of nodes is high for most nodes and clusters. The distribution, depicted in fig. 5.23, confirms this finding, especially when compared to both other cases. In the case of #sydneysiege (fig. 5.21), there are still many nodes, and about three clusters, that appear to have a medium level of closeness, compared to the max-

154

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

Figure 5.21: Force-directed visualisation of the diffusion tree network of the hashtag sydneysiege for the first 10 000 accounts using it, coloured by the undirected normalised harmonic closeness centrality relative to the minima and maxima found in the network (minimum: blue; middle: red; maximum: yellow)

imum in the diffusion tree. However, the distribution in fig. 5.24 confirms that the bulk of nodes engaged in #sydneysiege are considerably more distant to the centres of diffusion than their counterparts engaged in #illridewithyou. Finally, the diffusion tree for the link to the petition (fig. 5.22) shows only one cluster with medium closeness compared to the maximum (namely, the largest component). The distribution in fig. 5.25 confirms that, in this case, most nodes are comparably far away from one single dominating diffusion event. Together, the harmonic closeness, as calculated for the undirected network, is lower for most nodes for the case considered less ‘viral’ – that is, lower for #sydneysiege than for #illridewithyou – and drastically lower for the link to the petition. An inter-

5.4. ANALYSIS

155

Figure 5.22: Force-directed visualisation of the diffusion tree network of the petition link for the first 10 000 accounts tweeting it, coloured by the undirected normalised harmonic closeness centrality relative to the minima and maxima found in the network (minimum: blue; middle: red; maximum: yellow)

pretation of these results is given in the discussion below. To calculate harmonic closeness for a directed network, it is important to be aware that different software packages implement different definitions of harmonic closeness. A change of the indices j and i in eq. 5.4 to eq. 5.7, reverses the perspective: In the ji case, the distance from all other nodes to a target node is considered; in the ij case, the distance to all other nodes from an initial node is considered. Depending on how ‘direction’ is defined in a network, this obviously leads to completely different outcomes. In the case at hand, because of the way we defined the network in sec. 5.4.2.1, links are directed towards the source of the contagious items. Therefore, with the nor-

156

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

Figure 5.23: Distribution of the normalised harmonic closeness in the undirected diffusion tree of the first 10 000 accounts using the hashtag illridewithyou

Figure 5.24: Distribution of the normalised harmonic closeness in the undirected diffusion tree of the first 10 000 accounts using the hashtag sydneysiege

5.4. ANALYSIS

157

Figure 5.25: Distribution of the normalised harmonic closeness in the undirected diffusion tree of the first 10 000 accounts tweeting the link to the petition

malised version of harmonic closeness, we consider the distance from all other nodes, to produce a measure of the relative importance of a node for the whole cascade. Following eq. 5.7, a node i’s score Hi will increase according to the increased number of nodes j that have directly adopted the item from it. Furthermore, for every node k that adopts the item from a node j, i will get 1/2 of the score of a direct impact, for every node l that adopts from k 1/3, and so on, decreasing with the distance. This behaviour distinguishes harmonic closeness from a simple count of the nodes that participate in the cascade caused by a node. A normalisation of Hi , by dividing it by the number of nodes N − 1, limits the score to between 0 and 1, and allows an interpretation of the relative importance of a node in the overall network. This normalisation makes the harmonic closeness for directed networks comparable between diffusion events. In fig. 5.26 to fig. 5.31, network visualisations and histograms of the normalised harmonic closeness for the directed diffusion trees are shown, using the same layouts as in the visualisations for the undirected versions above. The nodes in the network visualisations are sized and coloured according to their harmonic closeness, relative to the maximum harmonic closeness in the network.

158

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

Figure 5.26: Force-directed visualisation of the diffusion tree network of the hashtag illridewithyou for the first 10 000 accounts using it; nodes coloured and sized by the directed normalised harmonic closeness centrality relative to the minima and maxima found in the network (minimum: blue; middle: red; maximum: yellow)

Due to the directed interpretation of the network, all nodes that are leaves of the contagion tree (i.e., at the end of their observed cascades) have a harmonic closeness of zero, and that value generally increases the further up in the hierarchy a node is found. However, the first node in a cascade does not necessarily have to have the highest value. For example, if it is separated by a long chain from a node further down in the tree that spawned many adoptions, its score will still profit from these many adoptions. However, due to its distance from these adoptions, its score will be lower than for the node that actually had this impact. The visualisation of the directed diffusion tree of #illridewithyou (fig. 5.26)

5.4. ANALYSIS

159

Figure 5.27: Force-directed visualisation of the diffusion tree network of the hashtag sydneysiege for the first 10 000 accounts using it; nodes coloured and sized by the directed normalised harmonic closeness centrality relative to the minima and maxima found in the network (minimum: blue; middle: red; maximum: yellow)

shows one dominant node reaching the maximum of harmonic closeness centrality; one second order node with a harmonic closeness between a middle value and the maximum; and a few more central nodes in the lower half of the spectrum. The #sydneysiege network (fig. 5.27) shows two nodes close to the maximum, followed by a few more nodes than in the #illridewithyou case that exhibit a value in the middle of the spectrum. Overall, however, the distribution seems comparable. The histograms in fig. 5.29 and fig. 5.30 also show that the shape of the distributions is comparable. The values for #illridewithyou tend to be slightly higher, however. For the petition link, the shape of the distribution is similar. Overall, the absolute values are considerably lower than in both other cases, however (fig. 5.31).

160

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

Figure 5.28: Force-directed visualisation of the diffusion tree network of the petition link for the first 10 000 accounts tweeting it; nodes coloured and sized by the directed normalised harmonic closeness centrality relative to the minima and maxima found in the network (minimum: blue; middle: red; maximum: yellow)

The visualisation (fig. 5.28) reveals that there are fewer nodes with a low value that is still perceivably above zero. This is confirmed by the higher counts towards the low values of harmonic closeness in the histogram. In summary, the distributions of harmonic closeness in the directed case show, expectedly, a similar behaviour to the harmonic closeness for the undirected network. Harmonic closeness typically tends to be lower for the #sydneysiege hashtag, which is regarded as less viral than #illridewithyou. Harmonic closeness shows the lowest typical values for the link to the petition.

5.4. ANALYSIS

161

Figure 5.29: Distribution of the normalised harmonic closeness in the directed diffusion tree of the first 10 000 accounts using the hashtag illridewithyou

Figure 5.30: Distribution of the normalised harmonic closeness in the directed diffusion tree of the first 10 000 accounts using the hashtag sydneysiege

162

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

Figure 5.31: Distribution of the normalised harmonic closeness in the directed diffusion tree of the first 10 000 accounts tweeting the link to the petition

Discussion While the number and size of connected components provides a sense of the macro-structure of the contagion trees, harmonic closeness centrality offers insights into smaller structures down to a single node. To interpret harmonic closeness centrality, it is necessary to differentiate between a directed and undirected interpretation of the network. In the directed case, the harmonic closeness centrality of a node in a diffusion tree is only influenced by the nodes in the cascade caused by the node. This means that it looks ‘forward in time’ from this node, and assesses the impact that the node had in the tree below. Therefore, it presents an isolated account of the direct and indirect impact of this node, compared to all other nodes. However, because it does not assess or extrapolate the future that has not been recorded, possibly impactful nodes – that just happen to have become active shortly before the data collection ended – will have a score close to zero. As such, the directed harmonic closeness is only suitable for identifying the nodes with the highest impact on the diffusion that has been recorded. In our contagion cases, considering fig. 5.29 to fig. 5.31, and the results discussed in sec. 5.4.2.3, there are more accounts with a higher individual importance

5.4. ANALYSIS

163

for the contagion of #illridewithyou than for #sydneysiege, and even fewer for the petition link. In the undirected case, harmonic closeness centrality carries information about what happened before and after a node became active. As every other node in the same weakly-connected component contributes to the undirected harmonic closeness of a node, the undirected harmonic closeness carries information; for example, about the size of the component that the node is part of. If a node is part of a bigger component, or has many close nodes that directly caused a lot of adoptions, this might indicate, for example, that the node is closer to a community where the item shared fell on fertile ground. This suggests that undirected closeness centrality should be suitable for improving predictions of how a cascade will develop, and for identifying regions of the contagion where high activity is more likely. And, indeed, while #sydneysiege was used more often due to the ongoing nature of the event, #illridewithyou constantly reached a very high number of new users, peaking higher than #sydneysiege (see sec. 5.4.1) in the stages after the timeframe analysed. This is in line with the finding, that it had the highest harmonic closeness values for the undirected tree (see fig. 5.23). While this remains anecdotal evidence, it is promising enough to test it with further cases in the future. In summary, for both harmonic closeness measures, directed and undirected, we observe typically higher values, the more ‘viral’ the event. This can be interpreted as a higher importance of single nodes in the directed case, and a greater proximity of nodes to the epicentres of contagion in the undirected case, in general. Both properties match with an intuitive understanding of virality.

5.4.2.4 Structural Virality Method and Results

Even though structural virality, as proposed by Goel et al.

(2015), is an easily understandable network measure with an intuitively useful interpretation, it has its limitations for this study. These limitations are comparable to the limitations encountered using the non-harmonic form of closeness centrality, which is not well defined for not fully connected diffusion networks. This section describes and tests some possibilities for circumventing this restriction, and their results.

164

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

Table 5.2: Structural virality (ignoring impossible paths), average degree, diameter, and number of connected components of the undirected diffusion tree network

illridewithyou

sydneysiege

petition

struct. virality

11.901

6.348

2.711

average degree

0.981

0.974

0.807

diameter

32

21

14

conn. components

178

244

2440

The definitional problem with ‘structural virality’, or ‘average shortest path length’, is the same as for ‘closeness centrality’: If two nodes are not connected, the shortest path length becomes undefined. If one follows the convention to define the path length as infinite, the structural virality diverges. In the case of the diffusion trees constructed, this would mean that a couple of accounts on Twitter that were tweeting independently would constitute an infinitely viral event; this is definitely not what was originally intended. By contrast, if non-existent connections are ignored, we might come closer to a useful interpretation, as isolated nodes would simply not count. For this case, the results of taking the average shortest path length between all nodes, while ignoring impossible connections, are found in tbl. 5.2. These results show a similar behaviour to that observed with harmonic closeness centrality in sec. 5.4.2.3. At around 11.9, the average shortest path length of the (undirected) diffusion tree of #illridewithyou is about double that of #sydneysiege (ca. 6.3), and more than four times the average shortest path length for the petition (2.7). However, as the length of shortest paths in a diffusion tree can show a skewed distribution – the longer the less likely, the shorter the more likely – an average or median might be misleading, or at least be hiding some parts of the picture. Therefore, a closer look at the distribution of the shortest path lengths, and the shortest path lengths of separate components in each diffusion tree, was necessary. Indeed, the lower the average shortest path length, the more skewed the distribution seems to be in our three cases (figs. 5.32, 5.33, 5.34). However, the average in all three cases still provides a good approximation of the typical value in the distribution. A more differentiated view of this effect, and the anatomy of the contagion cascades,

5.4. ANALYSIS

165

Figure 5.32: Histogram of shortest path lengths between any two nodes in the diffusion tree of the hashtag illridewithyou, ignoring nonexistent paths.

Figure 5.33: Histogram of shortest path lengths between any two nodes in the diffusion tree of the hashtag sydneysiege, ignoring nonexistent paths.

166

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

Figure 5.34: Histogram of shortest path lengths between any two nodes in the diffusion tree of the link to the petition, ignoring non-existent paths.

can be obtained by inspecting the distribution of average shortest path lengths per connected component, depending on the size of these components. Fig. 5.35 to fig. 5.40 show scatter- and corresponding boxplots, visualising the distribution of the average shortest path length per component in relation to the component size. Overall, it is clear and expected from the scatterplots that component size positively correlates with average shortest path length. This is explainable in part by the obvious fact that component size puts an upper limit on the maximum possible path lengths. Also, the skewed distribution towards smaller path lengths per component, as best seen in the boxplots, is expected, as larger components and longer path lengths are always less likely than small and short ones. All three cases exhibit a steep rise towards an average shortest path length of 2, with a heightened density of entries around this value, ranging across a broad spectrum of component sizes. This suggests a major role for ‘broadcast’-style networks (i.e., single nodes that directly impact on a large number of other nodes). In this case, the more nodes that are affected, the closer to 2 will be the average shortest path length (see fig. 5.1). This corresponds well with the observations possible in the

5.4. ANALYSIS

167

Figure 5.35: Scatter plot of the average shortest path length per component in relation to the logartihm to base 10 of the component size in the diffusion tree of the hashtag illridewithyou.

Figure 5.36: Boxplot showing the distribution of the average shortest path length per connected component of the illridewithyou diffusion tree. The red square marks the mean; the red line marks the median; box ranges from first quartile (Q1) to third quartile (Q3); whiskers placed at value closest to, but within, 1.5 times Q3-Q1 range (IQR) from box; green dots are outliers.

168

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

Figure 5.37: Scatter plot of the average shortest path length per component in relation to the logarithm to base 10 of the component size in the diffusion tree of the hashtag sydneysiege.

Figure 5.38: Boxplot showing the distribution of the average shortest path length per connected component of the sydneysiege diffusion tree. The red square marks the mean; the red line marks the median; box ranges from first quartile (Q1) to third quartile (Q3); whiskers placed at value closest to, but within, 1.5 times Q3-Q1 range (IQR) from box; green dots are outliers.

5.4. ANALYSIS

169

Figure 5.39: Scatter plot of the average shortest path length per component in relation to the logarithm to base 10 of the component size in the diffusion tree of the petition link.

Figure 5.40: Boxplot showing the distribution of the average shortest path length per connected component of the petition link diffusion tree. The red square marks the mean; the red line marks the median; box ranges from first quartile (Q1) to third quartile (Q3); whiskers placed at value closest to, but within, 1.5 times Q3-Q1 range (IQR) from box; green dots are outliers.

170

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

network visualisations. Regarding these differences, it is discernible that for comparable component sizes, the diffusion of #sydneysiege reached higher structural virality maxima (i.e., maxima of the average shortest path length) than the link to the petition. #illridewithyou and #sydneysiege appear comparable except for the largest components. The median and mean of #illridewithyou are slightly higher than for #sydneysiege. The difference is more pronounced for the link to the petition which, again, shows the lowest values. Even though the results for the average shortest path length per component are in line with the results for the average shortest path length for the whole network, while ignoring impossible paths, the interpretation of both measures raises a question about what is to be called the ‘typical’ or ‘average’ virality for diffusion events. This question is discussed below. Discussion Regarding structural virality (i.e., the average shortest path length), there is no obviously ‘correct’ way to circumvent the definitional problem that arises if we are not simply investigating a single connected component, but all contagion trees that have been recorded. Only analysing the largest component would miss a big part of the picture, as should be clear from the results for the petition link. I proposed two possibilities; both have differing interpretations but, in our case, lead to similar results. The first was to ignore non-existing links. This makes sense, as disconnected components decrease the average shortest path length compared to connected components. It appears to be a practical solution, because it shows clear differences for all three cases, in line with the other results of this study. Measured this way, the network for #illridewithyou is pronouncedly more structurally viral than that for #sydneysiege, which per se (and again) exhibits a higher structural virality than the network for the petition link. The distributions in fig. 5.32 to fig. 5.34 also show that, at least for these cases, and despite the skew of the distributions, the average is a good descriptive value. However, this needs to be confirmed with more empirical and theoretical work. This approach bears an implicit assumption, however: that the ‘typical’ structural virality is the average for the complete collected data. However, as is obvious from fig. 5.36 to fig. 5.39, the shortest path lengths are located on a broad spectrum in different

5.4. ANALYSIS

171

components. This inspires the second approach for dealing with disconnected components: If we interpret each of these components as an individual diffusion event, the ‘typical’ structural virality would be the virality of the smaller components, which contain fewer accounts but make up the majority of diffusion events. As we can see, if we compare these results per component with the averages gained by ignoring non-existent paths, the latter (see tbl. 5.2) are close to the values of the largest components in the scatter plots. This is because the largest components exhibit long path lengths while, at the same time, are made up of the most nodes. Therefore, from this perspective, the average obtained by simply ignoring missing links and considering every component as its own contagion event, is more an outlier than the norm. There are many nodes in few large components with long path lengths, but many components with few nodes and short path lengths. In this case, averages and medians still follow the same trend for the percomponent interpretation as for the per-dataset interpretation of structural virality, if comparing the cases analysed. However, the differences become markedly less pronounced. How this works out, and to what extend which interpretation of ‘typicality’ makes sense for other cases, should be the subject of further studies.

5.4.3 Exposure Analysis over Time, Complex and Simple Contagion The concept of structural virality implicitly assumes a contagion process similar to a disease: one contact is enough for infection. Therefore, our analysis above also implicitly followed this simplification. This is already the case for the reconstruction of the diffusion tree, where we assume that the most recently seen post alone triggers the use of a hashtag or the sharing of a link. This might be a justified simplification in some cases. Here, it also seems to reveal dynamics that might help our understanding of the spread of these items. However, as discussed in sec. 2.5.4.1, it lacks the detail required, for example, to understand the spread of items whose use might involve some social risks. This is undoubtedly the case for the sharing of an opinion-loaded hashtag such as #illridewithyou that defended Muslims against Islamophobia after a radical Islamist

172

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

terrorist attack. It might also be the case for sharing a petition against a decision that just over half of the nation voted for. Compared to these risks, the sharing of a hashtag that simply alerts followers to a major breaking news event (in this case, the Sydney Siege) is comparatively free of social risks. Therefore, we would expect differences in the complexity of the contagions processes for these items. To examine these effects, we have to construct other kinds of diffusion networks that assume the possibility that more than one tweet has influenced a single account in its adoption of a hashtag or a link. While the diffusion trees discussed above provide a simple way to track the contagion, the assumption that only the last exposure to a hashtag or a link caused the decision to share, is not realistic. Furthermore, a closer look at exposure times (sec. 5.4.3.2), the number of exposures (sec. 5.4.3.3), and the influence network derived from the exposure network (sec. 5.4.3.4), provides interesting insights and promising vantage points for further analysis.

5.4.3.1

Exposure Network

First, an exposure network could be easily constructed with a simple query to the graph database described in sec. 5.3. The query is also suitable for documenting the steps undertaken to construct the exposure network in a precise, formulaic notation. The comments in lines starting with // provide further explanation.22 : // find all accounts that tweeted a tweet in this dataset MATCH (u:UserName)-[:TWEETED]->(t:Tweet) WITH DISTINCT u, min(t.timestamp) AS mintime WHERE mintime IS NOT NULL

// and get their first tweet. MATCH (u)-[:TWEETED]->(t:Tweet {timestamp: mintime}) WITH DISTINCT u, t

// Then find all accounts they are following that have tweeted MATCH (u)-[f:FOLLOWS]->(v:UserName)-[:TWEETED]->(t2) 22

A good introduction to Cypher, the query language used by Neo4j can be found here: https: //neo4j.com/developer/cypher-query-language/

5.4. ANALYSIS

173

WHERE f.followed_after < t.timestamp

// after they were followed and

AND t2.timestamp < t.timestamp

// who tweeted before our first accounts.

// Then create an EXPOSED_TO relationship // storing the time difference between exposure and first tweet. MERGE p=(u)-[:EXPOSED_TO {time_difference: t.timestamp - t2.timestamp}]->(t2);

Figure 5.41: Example of a bimodal exposure network of accounts (purple) being exposed to tweets (green)

The result is a bimodal network with accounts connected to tweets via an exposure link, depicted exemplarily in fig. 5.41. This allows a more differentiated analysis of the contagion dynamics.

5.4.3.2 Typical Exposure Times Method and Results

Following the literature outlined in sec. 2.5.4.1, some non-

continuous break in the dynamics of the contagion, and therefore in the description parameters – that is, a phase transition towards a behaviour where the spread of the items goes ‘viral’ – was expected. To investigate this possibility regarding the speed of contagion, the time differences between exposure to an item and its first share by an account, were analysed. A method to inspect sudden changes of parameters in a system without knowing

174

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

Figure 5.42: Ranked time differences between exposure to the hashtag illridewithyou and the first own post by an account on a logarithmic scale, showing a discontinuity at around 60 minutes

Figure 5.43: Ranked time differences between exposure to the hashtag sydneysiege and the first own post by an account on a logarithmic scale, showing a discontinuity at around 100 minutes

their exact behaviour is to plot the parameter of interest ranked by its size. Sudden changes in the slope, possibly only visible when scaled accordingly (e.g., logarithmic), indicate a sudden change of the behaviour of the system, and allow the definition of ‘critical values’ of the analysed parameters (cf., use of the same method to discern high and low values without a reference frame by Himelboim et al. (2017)). Indeed, plotting all exposure times for every account and every tweet, ranked by

5.4. ANALYSIS

175

Figure 5.44: Ranked time differences between exposure to the link to the petition and the first own post by an account on a logarithmic scale, showing a discontinuity at around 600 minutes. Filtered to the first 10000 accounts tweeting the link.

their length on a logarithmic scale, reveals a discontinuity, which points to the fact that dynamics must have changed. This is the case for #illridewithyou at an exposure time of around 60 minutes (fig. 5.42); for #sydneysiege at an exposure time of around 100 minutes (fig. 5.43); and for the link to the petition at an exposure time of 600 minutes (fig. 5.44). This indicates a substantial change in the mechanisms of diffusion. However, the times concur simply with the time difference between the diffusion of each event going ‘viral’ (i.e., when the adoption curve started to exhibit an exponential increase) and the end of the time window analysed within the first 10 000 adoptions. Therefore, the few exposure times higher than these values are simply associated with accounts that have possibly been exposed to tweets that were posted before the contagion went ‘viral’, and tweeted at the very end of the analysed timeframe. Before this cut-off, there are some un-explained variations in the slope of the curve for the petition case, which might be interesting to investigate in further research. Furthermore, the curves appear smooth. It is not possible to find an exposure time that might be typical for some changed main dynamic. This indicates that exposure times below these values are caused by similar dynamics, and are comparable in the following analysis. Analysing the time differences between the earliest and the most recent possible exposure of an account to the hashtag or link and the first tweet of an account that used that hashtag or link, proved interesting. As we can see in fig. 5.45 to fig. 5.48,

176

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

Figure 5.45: Heatmap showing the number of accounts having used the hashtag illridewithyou, mapped according to their first and last possible exposure to the hashtag

Figure 5.46: Heatmap showing the number of accounts having used the hashtag sydneysiege, mapped according to their first and last possible exposure to the hashtag

5.4. ANALYSIS

177

Figure 5.47: Heatmap showing the number of accounts having tweeted the link to the petition, mapped according to their first and last possible exposure to the link, for the first 10 000 accounts only

Figure 5.48: Heatmap showing the number of accounts having tweeted the link to the petition, mapped according to their first and last possible exposure to the link, for all accounts

178

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

the observed accounts are divided into two groups for all three cases. The heatmaps map the number of accounts along their first and their last possible exposure before their first post of the hashtag or link. In all three cases, there is a line of maxima along the diagonal, representing a group of accounts that began to tweet after only one exposure23 , corresponding with what the literature would call ‘simple contagion’. By contrast, each figure also contains a more or less prominent group of accounts that became active only after more than one exposure; these would, therefore, correspond with a more complex contagion model. In each case, this second group is separated from the diagonal of accounts, with only one exposure by a region with lower counts. This might, however, be less a sign of user behaviour than the result of systematic constraints and artefacts caused by the data collection. A possible cause for the gap is that it is less likely for an account that eventually becomes active itself to be exposed to the same item twice in a short period of time, and then never again, than to be exposed several times over a much longer period before, finally, also posting actively. Also, the maxima of exposure time are of the same magnitude as the timeframe from when the diffusion ‘went viral’ in each case, and when it reached the cut-off-point in our data collection of 10 000 (#illridewithyou, #sydneysiege) or around 50 000 (petition) exposures. The most recent exposure time, however, might be characteristic of the diffusion events. The fact that all events show a local maximum at around 5 to 10 minutes on the simple contagion diagonal, hints at typical time spans in these diffusion events; and the fact that the petition link case shows such a markedly longer minimum exposure times in general, and a much less pronounced region of repeated exposures, points to a different dynamic from both other cases. One of the reasons for these elongated exposure times might be the changes to the Twitter timeline that were implemented in the interim. These changes made it less purely chronological, and offered an algorithmically determined ‘best of’ tweets that a user might have missed. Discussion Unfortunately, precise data about which tweets a user had on their screen was not available for this study. Therefore, the lower and upper limits (i.e., the first and the last possible exposure time), under the assumption that users were exposed to 23

or more, very quickly following multiple and, therefore, unlikely exposures

5.4. ANALYSIS

179

tweets only by the Twitter timeline, were analysed (as reported above). It is possible to clearly discern two groups of accounts in all three cases: A group of accounts that was only exposed once; one that was exposed multiple times, but within a relative short period of time; and a group that was apparently exposed more often over a longer time span. For the petition link, the separation of both groups appears less pronounced, and at markedly longer timespans. Meanwhile, the timescales are shorter and similar for both hashtags analysed. This indicates that, in general, the time difference between being possibly exposed to the link and ending up tweeting it, was around a factor 10 higher than for the hashtags. This is in line with our findings regarding the exposure counts, and indicates a lower intensity of exposure in the petition case (see below in sec. 5.4.3.3). The slightly shorter characteristic timespans for #illridewithyou are also in line with all other findings that indicate a higher intensity of exposure prior to its use for this hashtag than for #sydneysiege. However, regarding the petition link, the magnitude of the difference is so large that I assume that the timespans have been prolonged by Twitter’s algorithmically sorted timeline. This sorting algorithm, which presents tweets deemed as relevant at the top of a user’s timeline even hours after they have been tweeted, had not yet been introduced at the time of the Sydney Siege. This presents possible further research opportunities to reverse engineer platform algorithms by more systematically comparing, for example, datasets before and after the introduction of the algorithmic timeline, or to study possible changes caused by further changes to this timeline. Further theoretical and empirical work is necessary to understand and verify the emergence of these two discernible groups in this kind of contagion phenomena: On the one hand, further work might point to a group separation of accounts into simple and complex contagion groups (i.e., accounts that tweet immediately after they have seen a hashtag once; and, on the other, accounts that need more exposures to do so. However, there are a number of possible systemic issues that might render this finding an artefact. As mentioned in sec. 5.4.3.2, the gap might be caused, for example, by the phenomenon that two exposures immediately following each other are less likely than multiple exposures over a longer timespan, for example.

180

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

5.4.3.3

Number of Exposures

Method and Result

As data about the exact number of exposures experienced by

each account was not available, an upper limit was analysed. Again making use of the exposure network, constructed as described in sec. 5.4.3.1, the out-degree of accounts (i.e., the maximum possible number of exposures according to the available data) was measured, and its distributions analysed.

Figure 5.49: Cumulative distribution of accounts relative to the maximum possible number of exposures to the hashtag illridewithyou before the respective accounts tweet the hashtag. Green line indicates the number of accounts having tweeted it after a single possible exposure (simple contagion count).

Fig. 5.49 to fig. 5.52 show the cumulative distributions of accounts relative to the upper limit of exposures to the respective item in question. Both hashtags, #illridewithyou (fig. 5.49) and #sydneysiege (fig. 5.50), show a similar curve. However, the median for the number of possible exposures is higher for #illridewithyou (9) than for #sydneysiege (7). The number of adoptions after a single exposure (simple contagion count in the figure) shows only marginal differences. Fig. 5.51 and fig. 5.52 show the same diagram for the contagion of the link to the petition, once for the first 10 000 accounts only, and then for all accounts observed. Both show a notably lower median number of possible exposures, 2 and 3 respectively, than for the hashtags. At the same time, fig. 5.51 shows a lower maximum of accounts

5.4. ANALYSIS

181

Figure 5.50: Cumulative distribution of accounts relative to the maximum possible number of exposures to the hashtag sydneysiege before the respective accounts tweet the hashtag. Green line indicates the number of accounts having tweeted it after a single possible exposure (simple contagion count).

Figure 5.51: Cumulative distribution of accounts relative to the maximum possible number of exposures to the link to the petition before the respective accounts tweet the link, limited to the first 10 000 accounts. Green line indicates the number of accounts having tweeted it after a single possible exposure (simple contagion count).

182

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

Figure 5.52: Cumulative distribution of accounts relative to the maximum possible number of exposures to the link to the petition before the respective accounts tweet the link, for all accounts in the dataset. Green line indicates the number of accounts having tweeted it after a single possible exposure (simple contagion count).

(ca. 5000 instead of around 8000). This reflects the fact that many accounts in this network have not been exposed to the link via any channel that is considered here. In line with this, the number of accounts having tweeted the link after one single possible exposure is more than twice the number in the hashtag cases (i.e., around 2300 instead of ca. 1000), or almost half of the exposed accounts (compared to about 1 in 8 in the hashtag cases). However, even though these results are in line with all previous results, the number of exposures might be misleading. The number of followings on Twitter is typically heavily skewed, with many accounts only following a few others, when compared to the massive following numbers of a small number of ‘heavy users’. Indeed, we can observe an expected positive correlation between the number of followings and the number of exposures, as shown in fig. 5.53 to fig. 5.56. All graphs prominently show the restrictions imposed by Twitter on the number of followings – 2000 accounts initially, and 5000 accounts from October 2015 onwards – unless an account had enough followers24 . More importantly, however, the correlation between number of friends and exposures seems to show logarithmic behaviour; that is, it needs 24

see, e.g., https://www.engadget.com/2015/10/27/twitter-raises-following-limit-to-5000/

5.4. ANALYSIS

183

Figure 5.53: Heatmap showing the correlation of number of friends (followings) with the number of exposures, in the case of the hashtag illridewithyou

Figure 5.54: Heatmap showing the correlation of number of friends (followings) with the number of exposures, in the case of the hashtag sydneysiege

184

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

Figure 5.55: Heatmap showing the correlation of number of friends (followings) with the number of exposures in the case of the link to the petition, for the first 10 000 accounts

Figure 5.56: Heatmap showing the correlation of number of friends (followings) with the number of exposures in the case of the link to the petition, for all accounts in the dataset

5.4. ANALYSIS

185

an exponentially higher number of friends for a linear (or lower) increase in the number of exposures. The correlation is strongest for #illridewithyou, and weakest for the petition.

Figure 5.57: Cumulative distribution of accounts relative to the maximum possible number of exposures to the hashtag illridewithyou per followed account before the respective accounts tweet the hashtag, limited to the first 10 000 accounts

Fig. 5.57 to fig. 5.60, corresponding with fig. 5.49 to fig. 5.52, show the cumulative distributions of exposures per friend before an account tweeted the respective item itself. In all three cases, using a logarithmic horizontal axis, the distribution exhibits a sigmoid shape, reminiscent of an activation function in neural networks, or the cumulative normal distribution. In line with this, fig. 5.61, showing the actual distributions of the logarithm of exposures per friend, exhibits near-normal distributions for #illridewithyou and #sydneysiege. This is particularly interesting as it makes the application of conventional statistical methods possible. #illridewithyou exhibits, in line with the results above, a higher median and mean of these distributions than #sydneysiege. This also satisfies conventional limits of statistical significance according to Welch’s t-test (fig. 5.61). Both hashtags (figs. 5.57, 5.58) show means and medians an order of magnitude higher than the same measure for the first 10 000 accounts having tweeted the link to the petition (fig. 5.59).

186

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

Figure 5.58: Cumulative distribution of accounts relative to the maximum possible number of exposures to the hashtag sydneysiege per followed account before the respective accounts tweet the hashtag, limited to the first 10 000 accounts

Figure 5.59: Cumulative distribution of accounts relative to the maximum possible number of exposures to the link to the petition per followed account before the respective accounts tweet the link, limited to the first 10 000 accounts

5.4. ANALYSIS

187

Figure 5.60: Cumulative distribution of accounts relative to the maximum possible number of exposures to the link to the petition per followed account before the respective accounts tweet the link, for all accounts in the dataset

The development of this measure over time, also shows some interesting differences between the cases, as can be seen in fig. 5.62 to fig. 5.64. Overall, both hashtags evidently took a shorter time to reach the maximum exposures per friend measured than the petition link, and show a rapid growth dynamic. While #illridewithyou shows a more symmetric distribution in the early stages (fig. 5.62), #sydneysiege is more skewed towards lower values at the beginning of the contagion; this points to a more simple contagion in the early stages (fig. 5.63). This is explainable by the surprise factor of the event – a news factor that evidently increases the success of news on Twitter in general. The diagram for the link to the petition (fig. 5.64) exhibits a constant picture of lower intensity than in both other cases, with little to no growth over time. Here, it is also evident that the gap in data collection does not lead to discontinuities in the number of exposures. Discussion It is important to note that the causalities involved in exposure counts and tweeting behaviours remain unknown after this study: Did accounts post because they had been exposed to more tweets? Were there more exposures because more people

188

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

Figure 5.61: Distributions of the logarithm of exposures per friend before tweeting the hashtags illridewithyou and sydneysiege, both resembling a normal distribution; and results of Welch’s t-test regarding significance of the difference of their means

tweeted? Are both numbers mainly determined by underlying community structures? Was a combination of all three mechanisms at play? Furthermore, we are working with a potential upper limit, and not the actual exposure count. With the data collected, it is not possible to determine whether a user has actually had a tweet on their screen or not. There is also no data collected regarding accounts that had not tweeted the items in question; that is, there is no information available about accounts that might have seen the items, but have not acted on them. Nevertheless, the results make sense when cross-validated against theory and the remaining findings. With this in mind, the analysis of the upper limit of exposures indicates that accounts typically have had a higher number of possible exposures to the hashtag #illridewithyou than to #sydneysiege before they began using the hashtag. The number of accounts that posted the hashtags #sydneysiege and #illridewithyou after a single exposure – corresponding to the concept of simple contagion – is about the same in the

5.4. ANALYSIS

189

Figure 5.62: Heatmap showing the distribution of infections (adoptions) by exposures per friend over time for the hashtag illridewithyou. Time in seconds of UNIX time (epoch).

Figure 5.63: Heatmap showing the distribution of infections (adoptions) by exposures per friend over time for the hashtag sydneysiege. Time in seconds of UNIX time (epoch).

190

CHAPTER 5. STUDY 1: MEASURING COMMUNICATION CASCADES

Figure 5.64: Heatmap showing the distribution of infections (adoptions) by exposures per friend over time for the link to the petition for the first 10 000 infected accounts. Time in seconds of UNIX time (epoch).

observed datasets, however. For the petition link, this number is more than doubled, while the typical exposure counts for multiple exposures are markedly lower. The fact that the contagion of #illridewithyou is, therefore, ‘more complex’ than for #sydneysiege, and ‘most simple’ for the link to the petition, still holds true if we factor in the number of accounts an account follows. This qualification is motivated by the assumption that if an account is following more accounts, it will be harder for any one item to be taken up by this account: each tweet will compete with many other tweets for the user’s screen and attention-space. The logarithm of the exposures per friend, which we might call ‘peer pressure’, proved itself the most productive measure for this analysis, as its distribution over the number of accounts that have tweeted one of the hashtags or the link, follows a near-normal distribution (fig. 5.61). The resulting cumulative distribution resembles a sigmoid transmission function (fig. 5.57 to fig. 5.60), as it is also found in models of neural networks. This supports an interpretation that the parameters of this distribution function for different items should provide us with useful characteristic values, such as mean and median (typical number of exposures per friend after which an account picks up

5.4. ANALYSIS

191

an item); standard deviation (the uniformity of accounts in their reaction to the item); skew (Do accounts rather tend to need more or less exposures than the typical value?); and so on. All of these characteristics values indicate the virality of an item. They also indicate typical values of peer pressure after which an account typically picks up an item. If, in further studies, this distribution function also turns out to show a nearnormal distribution, its main advantage will be the applicability of common and widely accepted statistical methods.

5.4.3.4 Influence Network Method and Result

Until now, with the exception of the reconstruction of the

exposure network, the underlying community structure has been ignored. However, community structure can have a major impact on the spread of an item (see sec. 2.5.4.1). Unfortunately, for this study, the complete follow network data was not available as, due to time constraints, only the followings (but not the followers) of the first 10 000 accounts were collected. Therefore, we have no information about who might have been exposed to an item but had not shared it. However, the available data made it possible to examine who influenced whom, and how often, before the item was shared. To this end, the exposure network was transformed to a weighted network of ‘influence’ links, following the procedure described by the comments in the following example code in Cypher: \\ 1. Find all users u exposed to any tweets \\

and who has tweeted them.

MATCH (u:User)-[e:EXPOSED]->(t:Tweet)

Suggest Documents