name and contact information. The literature .... Feature name. Wikipedia names- pace codes. Feature description. Ba sic
Evolution of Privacy Loss in Wikipedia Marian-Andrei ∗ Rizoiu
arXiv:1512.03523v3 [cs.SI] 17 Dec 2015
NICTA, ANU
Lexing Xie
ANU, NICTA Canberra, Australia
ABSTRACT The cumulative effect of collective online participation has an important and adverse impact on individual privacy. As an online system evolves over time, new digital traces of individual behavior may uncover previously hidden statistical links between an individual’s past actions and her private traits. To quantify this effect, we analyze the evolution of individual privacy loss by studying the edit history of Wikipedia over 13 years, including more than 117,523 different users performing 188,805,088 edits. We trace each Wikipedia’s contributor using apparently harmless features, such as the number of edits performed on predefined broad categories in a given time period (e.g. Mathematics, Culture or Nature). We show that even at this unspecific level of behavior description, it is possible to use off-the-shelf machine learning algorithms to uncover usually undisclosed personal traits, such as gender, religion or education. We provide empirical evidence that the prediction accuracy for almost all private traits consistently improves over time. Surprisingly, the prediction performance for users who stopped editing after a given time still improves. The activities performed by new users seem to have contributed more to this effect than additional activities from existing (but still active) users. Insights from this work should help users, system designers, and policy makers understand and make long-term design choices in online content creation systems. Keywords online privacy, de-anonymization, temporal loss of privacy.
1.
INTRODUCTION
Privacy is a relatively new concern of modern society [14]. Historically, compromising one’s privacy was a difficult task, being mainly achieved by using constant physical surveillance, costly by nature, and easy to thwart. The advent of the online environment has changed the privacy landscape: users of social network, blogging, microblogging plat∗Corresponding author:
[email protected] Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).
WSDM’16, February 22–25, 2016, San Francisco, CA, USA. c 2015 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-3716-8/16/02.. DOI: http://dx.doi.org/10.1145/2835776.2835798
Tiberio Caetano
Ambiata, ANU, UNSW Sydney, Australia
Manuel Cebrian
NICTA Melbourne, Australia
forms willingly or unwillingly share information with the public and with organizations. The general public are already aware [4, 16] that information inadvertently left online can hurt privacy, and researchers showed that [12] personal attributes can be predicted from these online behavioral traces. However, the longitudinal change of privacy loss is not well understood – namely, how information collected over several years can compromise privacy, and how the predictability of private attributes evolve. In this paper, we set out to answer such challenging questions by curating a novel large-scale behavioral trace dataset, and by measuring the predictability of personal traits in a number of ways. We construct a new dataset from all editing activities in and around Wikipedia – the largest encyclopedia to date collaboratively constructed by hundreds of thousands of users. We use as input each user aggregated editing activities in a number of broadly defined content and community categories, and the target output are personal traits from Wikipedia badges, i.e., what users choose to disclose on their personal pages. This problem and system setting allows us to make several key observations: (1) We show that Wikipedia editors’ private traits can be inferred using off-the-shelf machine learning algorithms, and that the prediction performance consistently improves over our prediction period from 2007 to 2013. In particular, our results include predicting an individual’s gender, educational status and religious views. Among the different personal attributes, a subset of showed high prediction accuracy as measured by the equal-precisionrecall metric – namely, the editors’ gender at 0.79, practicing muslim religion at 0.9 or being jewish at 0.91. (2) We quantify the effect of different features using a temporal measure called information transfer. We observe that while the marginal utility of newer features decreases over time, the new users consistently add additional information for the prediction tasks. (3) We show that the prediction of private attributes continues to improve for users who exited the system – or stopped editing after 2007. The continued loss of privacy for these users seems to be associated with two quantifiable factors: the information learned from a user’s own activity (or online breadcrumbs) and the activity of other editors. To the best of our knowledge, this is the first work to quantify longitudinal change in privacy loss, carried out on a large dataset and over more than 13 years. Not only do we show that private traits can be predicted increasingly well with time, we also provide several methods to quantify the value of new information over time, and the different source of information loss - from more activity or more users. Our
findings suggest privacy continues to erode for all users, even after one stops publishing data online. Our findings also can help design data storage and retention policies, can make users aware of the implications of seemingly harmless online activities, and adds to the very lively research topic about online privacy.
2.
RELATED WORK
The value of privacy. Social scientists have been interested in how individuals perceive and values their privacy. Acquisiti et al. [2] revealed that the perceived value of privacy, while not entirely arbitrary, is highly malleable. For example, there is a large gap between the amount of money that individuals would accept to disclose private information and the amount of money they would pay to protect it. Furthermore, certain categories seem particularly vulnerable to the online privacy issue. Young adults have been shown [5] to be as worried about their privacy as older adults, especially in what concerns giving personal information to businesses, having photos of them uploaded to the internet or the legislation protecting privacy. On the contrary, Hoofnagle et al. [10] found that young adults tend to expose themselves more, especially on popular social networks, because they are less aware of the risks, less informed about the protection given by law and more prone to social peer pressure. Our work can contribute to the public understanding of privacy risks, by studying on public open data and with quantifiable outcomes. Privacy Loss and inferring private traits. One definition of Personal Identifiable Information (PII) is private information relating to a person, which can be deductively identified, based on the person’s public profile [19]. This is a source of concern particularly in the context of the Open Data effort of governments, in which anonymous datasets are released publicly, after removing private attributes such as name and contact information. The literature shows many applications in which private information, which was never intended to be publicly released, can be inferred from apparently harmless data. In one notorious example, the medical condition of an American politician was inferred starting from anonymized medical records released to the public [23]. Researchers found that de-anonymization can be carried out in large-scale. The 2011 IJCNN Social Network Challenge was won by de-anonymizing the identity of the Flikr users, including those in the test set [18]. More recently, De Montjoye et al. [7] showed that only four spatio-temporal points are enough to identify 95% of individual trajectories using mobile carrier’s antenna information, while the same group [17] show that 90% of individuals can be re-identified using their credit card transactions trajectory. Another salient source of private informations are patterns of online behavior. Kosinski, Stillwell and Graepel showed [12] how private traits like gender, sexual orientation, ethnic origin and even the fact that a user’s parents have divorced before her twenty-first birthday can be accurately inferred from apparently harmless, naturally revealed public data, such as Facebook likes. In another data domain, the pattern of an individual’s online or phone activity have been shown [21] to reveal precious information about her habits and preferences. Furthermore, computer-based evaluations of human personality have been shown more accurate than those of close friends [27]. We show that private information can be extracted not
only from structured anonymized datasets (such as [23]) or datasets rich in social information (such as [12, 21]), but even from data traces left for the public good, such as Wikipedia. Ramachandran and Chaintreau [20] recently studies how the structure of locally connected individuals affects privacy loss. Our focus is in the time dimension – in quantifying the temporal evolution of privacy loss. Editing behavior in Wikipedia. Wikipedia is the largest online collaborative encyclopedia. In its early years, Wikipedia showed rapid growth. Initial studies [3] explained the growth as driven by the rapidly increasing user base. The growth of the English Wikipedia slowed after 2007, with fewer new editors joining, and fewer new articles created. A few studies [8, 22] explain this dynamic as Wikipedia editors face increasingly limited opportunities to make novel contributions, with the easy articles already been created, leaving only more difficult topics to write about. To make useful contributions to the site, editors must also meet an increasingly high bar of expertise in their field. In addition, Halfaker and McNeil [9] consider that Wikipedia’s mechanisms for managing quality and consistency deterred newcomers. In our profile of Wikipedia’s growth and decline (Sect. 5), we were surprised to see that the changes of activity over time are not uniform across content and personal demographic attributes. There is a rise in site maintenance, and some user groups showed slower decline (PhD), or even a rise (self-identified muslism users) in editing activities. Finally, while social interactions in Wikipedia have been studied before [6, 26], to the best of our knowledge, this is the first study of private traits that can be inferred from editing activities.
3.
WIKIPEDIA ACTIVITIES AND USER TRAITS
Why Wikipedia? Wikipedia is an ideal data source for studying longitudinal predictability of private traits, due to the following three reasons. Firstly, it is an apparently harmless dataset, whose purpose is to be a reservoir of knowledge, with little or no focus on personal or social information. Unlike online social networks centered on users’ profiles, Wikipedia is not intended to record any individual contributor’s personal and social interactions. Secondly, Wikipedia’s entire edit history is publicly accessible. Wikipedia provides the longitudinal editing history for individuals, spanning over a decade – 2001–2013 at the time of our snapshot. Such a unique long temporal extent allows the study of the effect of time in online privacy. Finally, Wikipedia contributors are from many geographic locations and numerous social, religious, educational and political backgrounds – providing a rich and diverse sample for activities and candidate personal traits. We show that as Wikipedia accumulates user data and editing activity, increasing amounts of private information can be inferred (see Sect. 6). The Wikipedia dataset. Our dataset contains 13 years of edit history, from the beginning of Wikipedia in January 2001 to July 2013. 188,805,088 revisions are performed by 117,523 editors to 22,172,813 pages. A revision is a Wikipedia term referring to an atomic edit of a page by a user with an associated timestamp. All data used in this study are obtained from July 2013 public Wikipedia dump, more details about data processing are in the supplemental material (SI) [1]. For predicting private traits (Sect. 4 and 6.1) we limit the studied period between 01/2007 and 07/2013 in order to have sufficient numbers of users.
Table 1: Features to describe user editing patterns. (A) The basic feature set quantifying the number of edits to all Wikipedia articles and various community and user pages. (B) additional features in the extended feature set, encoding edits to the thematic categories within Wikipedia CONTENT.
Basic feature set
A.
B.
Feature name
Wikipedia pace codes
names-
CONTENT
0, 6
TALK-C
1, 7
USER
2
TALK-U
3
WIKI INFRA
4, 5 8, 9, 10, 11, 12, 13, 14, 15, 100, 101
Feature description Revisions made to the body of the Wikipedia articles. Corresponds to the actual creation of information. Discussions on the talk pages of the Wikipedia articles. Corresponds to the overhead around the creation of encyclopeadic information. Revisions made to user pages, including a user’s own page or another user’s page. Similar to profile edits and posts in online social networks. Revisions on the talk page corresponding to user pages. This is similar to social discussions, e.g., writing to a user’s wall in a social network. Revisions on community pages, help desk, village pump, and related talk pages. Revisions on pages that provide infrastructure for other tasks in Wikipedia; template, categories and portals.
Extended feature set: the 23 thematic features added to the basic set
AGRICULTURE, APPLIED-SCIENCES, ARTS, BELIEF, BUSINESS, CHRONOLOGY, CULTURE, EDUCATION, ENVIRONMENT, GEOGRAPHY, HEALTH, HISTORY, HUMANITIES, LANGUAGE, LAW, LIFE, MATHEMATICS, NATURE, PEOPLE, POLITICS, SCIENCE, SOCIETY, TECHNOLOGY
User activity profiles. We encode a user’s activity using two sets of features. In the basic set, we count the number of revisions performed in a given period of time, over six predefined categories of the edited pages. The intuition behind these features is to capture the intent of a user’s editing effort. For example, the CONTENT feature captures edits made to main Wikipedia articles and can be associated with the knowledge creation effort. Similarly, the WIKI and INFRA features quantify the effort put into organizing the editing effort (i.e., community pages, help desks), while USER and TALK-U captures the social components, such as constructing a personal page and talking to other users. Features in the basic set are based on the Wikipedia namespaces and a summary of their respective meanings is in Table 1. Wikipedia namespaces are organizational categories, to encode the intended purpose of a page (details in SI [1]). The second set of features, i.e. the extended set, is constructed by adding to the basic set 23 new categories based on Wikipedia’s toplevel category hierarchy. A Wikipedia page can be assigned by its editors to one or more of the 23 thematic categories such as History, Geography, etc. The extended set provides a more detailed profiling of a users’ activity, by capturing their editing interests. Details of feature construction are described in Sect. 4. The editors’ personal information. Many Wikipedia users keep a user page (resembling a social network profile), on which they distribute information about themselves, their interests or the causes they support. Some distribute information typically considered as private, such as their gender, ethnic origin, religion, education, or even sexual preferences. We retrieve these records using public APIs1 , and use them as target personal traits. For the purpose of this study, we selected three private traits with sufficient user bases: gender (declared by 6936 users), education (undergrads, grads and Phd, declared by 9224 users) and religion (christian, muslim, atheist or jewish, declared by 7685 users).
4.
MEASURING PRIVACY LOSS
We study the loss of privacy by modeling it as a prediction 1
https://www.mediawiki.org/wiki/API:Users
problem: how well can we predict a given a class variable Y (i.e., gender, education or religion) using descriptive features X in the basic or extended set. By following a set of users through time, we observe the dynamics of the predictive performance. We define the temporal loss of privacy as better explaining a variable linked to a private trait, as we observe editing behaviors for longer periods of time. In this work we use two sets of tools. The first is a predictive approach: we predict the private traits of a hold-out set of users on increasingly longer activity history on Wikipedia, and we observe the change in prediction accuracy. The second approach uses information transfer, a measure from physics and economics, to quantify the uncertainty in Y explained by feature X over time.
4.1
Encoding activities over time
We denote a feature Xiu as computed in the timeframe i for user u, with X ∈ {CONTENT, TALK-C, USER, TALK-U, WIKI, INFRA} for the basic set, and encoded similarly for the extended set. We construct a series of temporal datasets, each having a 3-month period in addition to the previous. A user appears in a temporal dataset if she/he has performed at least one revision during or before the last 3-months. We construct two kinds of features over time, the instantaneous features fiu for user u in the ith 3-month period alone, and u the longitudinal feature Fiu = [f1:i ] for user u – containing the series up to (and including) timeframe i. Features are constructed by counting the number of revisions performed by u during the given timeframe, over the predefined categories, e.g., fiu = (CONTENTui , TALK-Cui , USERui , TALK-Uui , WIKIui , INFRAui ) for the basic set. Naturally, features Fiu describe both the past and the current activities, and contain temporally increasing quantities of information. In addition, for newly joined editors, we explicitly encode the missing values in previous timeframes. This is done by including a binary missing feature flags for each activity category, a value of 0 means an editor has joined Wikipedia (even if she is on a pause during timeframe i), and 1 means the editor has not joined Wikipedia, i.e. missing. For the basic set, this results in additional six binary features, i.e., fiu = (CONTENTui , p_Cui , TALK-Cui , p_TCui , USERui , p_Uui , TALK-Uui , p_TUui , WIKIui ,
Y X1
X3
I(Y ; X3|X1:2)
X2
Figure 1: Venn diagram illustration of the Information Transfer measure between target variable Y and feature X, over three successive time points X1 , X2 and X3 . Even if X3 explains a large portion of the information in Y , most of Y was already explained by X1 and X2 , and the new informationX3 brings is given by the conditional mutual information I(Y ; X3 |X1:2 ). p_Wui , INFRAui , p_Wui ), where features prefixed with p_ are the missing flags of the preceding feature. For the extended set, additional features for the 26 categories are constructed in the same manner and appended to each fiu . We tried other schemes to encode user activity, such as cumulative features to encode activities from the beginning of until timeframe i, and found them to have lower performance. Therefore the rest of the paper presents only the incremental scheme.
4.2
Predicting personal attributes
We evaluate how well a private trait can be predicted by setting up a set of binary classification tasks. Multi-class target variables (i.e., religion and education) are transformed to binary prediction tasks in a one-vs.-all fashion (e.g., christian vs. non-christian). We use 2:1 stratified splits to construct the training and test user subsets – 66% of the editors are randomly selected to be part of the training set, and the remaining 33% are used for testing, with the random sampling preserving class priors. One model is learned for each class and each time period, using logistic regression classifier with L1 regularization that favors sparse feature weights, with the hyperparameter obtained by cross-validation [11]. We also tried the L2 regularizer and observed lower performances. The performance is evaluated using the AUC metric (the area under the ROC curve) [11], intuitively random guess classifiers have an AUC of 0.5, and perfect classification has 1.0. Accuracy and F-score for predicting each private trait are given in the SI [1]. We independently sample the train/test split 10 times, and record the mean and standard deviation of the AUC. While other classifiers can be used, we have not tried them in our experiments since our interest lies in the evolution of prediction performance and not its absolute value.
4.3
Information transfer over time
Information theory measures are useful for capturing feature relevance in prediction tasks [13]. One particular measure, Information Transfer, was recently used [24] to uncover hidden links in social media. Intuitively, we capture the uncertainty of private information with the entropy of the target variable Y . The quantity of private information explained by feature X is then given by the mutual information I(Y ; X). Since feature X takes different values for each
3-months timeframe, it is useful to quantify the additional information contained in time period Xt that were not already contained by earlier features X1:t−1 . The Information Transfer measure, I(Y ; Xt |X1:t−1 ), is designed for this purpose. Fig. 1 illustrated the intuition behind I(Y ; Xt |X1:t−1 ) using a Venn diagram. In this example, X3 contains quite a lot of information about Y – expressed as the mutual information I(Y ; X3 ), or the large intersection between red and blue circles. However, the new information that X3 provides, in addition to X1:2 , is much smaller – expressed as I(Y ; X3 |X1:2 ), as I(Y ; X3 ) minus the part already covered by I(Y ; X1 ) and I(Y ; X2 ). Intuitively, information transfer is the Conditional Mutual Information of Y and Xt given X1:t−1 , or the amount of uncertainty that will be reduced after observing Xt : I(Y ; Xt |X1:t−1 ) = H(Y |X1:t−1 ) − H(Y |X1:t ) . The relationship above follows from the definition of mutual information and conditional entropy [13]. We implemented information transfer using the infotheo toolbox in R [15], which computes conditional entropies and in highdimensional input spaces by quantizing the input. We use Information Transfer I(Y ; Xt |X1:t−1 ) and conditional entropies H(Y |Xt ) and H(Y |X1:t ) to answer two key questions: which features are most important in the disclosure of private information, and which time periods are critical to privacy loss.
5.
EDITING BEHAVIOUR OVER TIME
In this section, we present a profile of wikipedia editing behaviour over time. While our profile concur with the slowdown of Wikipedia [8, 9, 22], our analysis detects the rise of maintenance effort and shows that different user groups contribute differently to various functional and topical sections of the encyclopedia. The decline of editorship and rise of maintenance. It has been observed [8, 9, 22] that Wikipedia’s growth slowed since 2007, with fewer new editors joining, and fewer new articles created. Our profiling shows the same phenomenon. Fig. 2 shows the number of active editors, new editors, and number of edits over time. A user is considered active in a time interval if she has submitted at least one revision in the given period. A user is considered a new users (or a newcomer) if she made her first revision in the given time interval. We can see that the slowdown started in 2007, with both the active population, and the total number of revisions decreasing steadily – as seen in the volume of CONTENT revisions in Fig. 2a, and revisions to the Nature section in Fig. 2b. We can also see that while the overall growth rate is slowing, the increasing amounts of accumulated information require an increasing effort to organize. Fig. 2c shows that the number of infrastructure-related revisions (INFRA) continues to increase, made by a decreasing number of users. To the best of our knowledge, this is the first work to detect and quantify this rise of maintenance. Different growth trends across editor demographics. One explanation [8] for the slowdown of Wikipedia is that the easy articles have already been created. This means that in order to make a novel, useful contribution to the site, editors must meet an increasingly high bar of expertise in the field. Fig. 3a plots the active population size for users with a declared education level. The three curves corresponding to undergrads, graduates and PhD have been scaled in [0, 1] to
35%
30%
0% TALK-C
0.6
0.4
0.3
0.2
0.1
Relative counts out of ALL edits
1
0 0
undergrads (3878)
PhD (784)
0.6
grads (4552)
0.2
0.0
Female Male
25%
20%
CONTENT
15%
USER TALK-U WIKI
0.5
(c)
Number of editors (in thousands)
education− Active editors over time 1.5
10
1.0
5 0.5
0 0.0
(a)
1.0
0.4
(a)
Gender - aggregated basic features
47.65 52.35 % %
41.02 58.88 % % NonCONTENT
10%
5%
INFRA
Education − aggregated descriptive features
0.0
Number of revisions (in millions)
15
(a)
undergrads grads PhD
0.5
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
5 20
2.0
muslim (895) jewish (827)
0.8
christian (3472)
0.6
0.2
atheist (2491)
0.0
Number of editors (in thousands)
2.5
− 07 200 − 2 01 200 − 2 07 200 − 3 01 200 − 3 07 200 − 4 01 200 − 4 07 200 − 5 01 200 − 5 07 200 − 6 01 200 − 6 07 200 −2 7 01 00 − 7 07 200 − 8 01 200 − 8 07 200 −2 9 01 00 − 9 07 201 − 0 01 201 − 0 07 201 − 1 01 201 − 1 07 201 −2 2 01 01 − 2 07 201 −2 3 01 3
10
Active editors New editors Total revisions
04
2
25
0.8
0.6
0.2
0.0
# active editors (scaled to [0, 1])
15
Number of revisions (in millions)
3
NATURE − Active/New editors and #revisions
01
0.4 04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
Number of editors (in thousands) 20 30
50%
Gender - aggregated extended features
0%
Relative counts out of CONTENT edits A gr ic ul tu r N e at ur e La w H ea l H th ist or Sc y i La enc ng e ua ge M Bel at ie he f m G ati eo cs g H rap um hy an it So ies ci e Bu ty sin es s Li Po fe Ed litic uc s at io n Te Ar ch ts no Ch lo r o gy no lo gy En Cul vi tur ro e nm e A Pe nt pp o .S ple ci en ce s
0.8 4
− 07 200 − 2 01 200 − 2 07 200 − 3 01 200 − 3 07 200 − 4 01 200 − 4 07 200 − 5 01 200 − 5 07 200 − 6 01 200 − 6 07 200 − 7 01 200 − 7 07 200 − 8 01 200 − 8 07 200 − 9 01 200 − 9 07 201 − 0 01 201 − 0 07 201 − 1 01 201 − 1 07 201 − 2 01 201 − 2 07 201 −2 3 01 3
1.0
# active editors (scaled to [0, 1])
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
04 25
Active editors New editors Total revisions
01
# active editors (scaled to [0, 1])
CONTENT − Active/New editors and #revisions
CONTENT TALK_C USER TALK_U WIKI INFRA Agriculture Nature Law Health History Science Language Belief Mathematics Geography Humanities Society Business Life Politics Education Arts Technology Chronology Culture Environment People App.sciences
CONTENT TALK_C USER TALK_U WIKI INFRA Agriculture Nature Law Health History Science Language Belief Mathematics Geography Humanities Society Business Life Politics Education Arts Technology Chronology Culture Environment People App.sciences
Relative counts out of ALL edits
Relative counts out of all edits
− 07 200 − 2 01 200 − 2 07 200 − 3 01 200 − 3 07 200 − 4 01 200 − 4 07 200 − 5 01 200 − 5 07 200 − 6 01 200 − 6 07 200 −2 7 01 00 −2 7 07 00 − 8 01 200 − 8 07 200 − 9 01 200 − 9 07 201 − 0 01 201 − 0 07 201 − 1 01 201 − 1 07 201 − 2 01 201 − 2 07 201 −2 3 01 3
01
30
6
Active editors New editors Total revisions
4
0.2
2
0.1
0
0.0
(b)
religion− Active editors over time
0.4
(b)
0.4
Number of revisions (in millions)
INFRA − Active/New editors and #revisions
Religion − aggregated descriptive features
0.4
0.3
(c)
Figure 2: Wikipedia growth slowing down. (a) The decrease of the number of active editors, new editors and the total number of revisions for CONTENT (a) and thematic features (shown here NATURE, others in the SI [1]) (b). (c) The maintenance effort (INFRA revisions) needed to internally handle the bulk of Wikipedia is increasing.
1.0
gender− Active editors over time
male (4990)
female (1946)
(c)
Figure 3: The population of active editors over time, broken down by (a) gender, (b) education and (c) religion. Magnitudes for all classes are scaled from 0 to 1. Barplots show the relative effectives of classes, absolute effectives are given in parenthesis.
40%
Female Male
30%
20%
10%
(b)
atheist christian jewish muslim
0.3
0.2
0.1
0.0
(d)
Figure 4: Aggregated descriptive features show differences in editing patterns, when tabulated per gender (a) and (b), education (c) and religion (d). Features were computed as percentages out of the total revision count Particularly interesting features (i.e., features on which the separation is clearer or some patterns are inverted) are highlighted.
0.10 0.05
Percentage out of all revisions
0.15
0.04 0.03 0.02 0.01
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
10 00 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 −2 7 04 00 −2 7 10 00 − 8 04 200 − 8 10 200 −2 9 04 00 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
−2 04
(a)
0.20
atheist christian jewish muslim
0.15
0.10
0.05
0.00
0.00
0.00
0.25
(b)
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
0.20
f m
04
Percentage out of all revisions
0.25
0.05
04
Percentage out of all revisions
grads PhD undergrads
religion − CONTENT percentage of ALL
gender − USER percentage of ALL
education − CONTENT percentage of ALL 0.30
(c)
Figure 5: Temporal evolution of mean values of features, broken down by classes in each private trait. We present three selected examples of pairs (feature, trait), other graphics are in the SI [1]: (a) (CONTENT, education), (b) (USER, gender ) and (c) (CONTENT, religion). All values are computed as the number of revisions on the given category during a timeframe and expressed as percentages of the total number of revisions. The mean value over all users is presented. render them comparable. All three population show the expected initial rapid increase. The undergrads and graduates reach maximum at the same time as the general population The more specialized PhD population seems to peak much later, in early 2010, which seems to confirm the hypothesis that the required increase in the specialization of editors is one of the factors responsible for the slowdown of Wikipedia. We can also see differing demographic trends across different groups in editors’ religion and gender. In Fig. 3b, the number of christian, jewish and atheist editors start to decrease around 2007. On the other hand, the self-declared muslim population seems to continuously increase, at a slower pace than during Wikipedia’s initial growth 2001-2007. Fig. 3c plots the number of active editors by gender, we can see that the number of active female editors started decreasing earlier than that of male editors. While gender imbalance in Wikipedia has been previously studied [9], no other discussion of the evolution across time of gender, religion and education is present in prior literature. Aggregated edit counts correlate with private traits. We describe a user’s editing activity by aggregating her revision counts over a number of predefined categories. We conduct a exploratory analysis by presenting the averages of the features, with consideration for each of the private traits. Fig. 4a shows differences between the average male and female behavior: 59% of all the revisions performed by males are CONTENT, compared to 48% for females. Females tend to socially relate more, by writing more on USER pages (35% of all revisions for females, less than 25% for males). Similarly, Fig. 4b presents average male (highlighted in blue) and female (highlighted in red) behavior, over features in the extended set. Females edit more subjects like Agriculture, Health, History, Language, Belief, Arts and People, while males edit more Mathematics, Society, Business, Geography and Culture. While the absolute differences between average behavior on gender tend to be rather small, they indicate a separability of the two classes. We perform a similar analysis for education (Fig. 4c) and religion (Fig. 4d): undergrads create less CONTENT revisions and more USER revisions. graduates and PhD populations both dedicate a higher percentage of revisions to CONTENT. The PhD are more active on technical categories, such as Science, Mathematics, Education, Technology, Health and Applied Sciences, and the graduates edit more subjects like People, Environment,
Culture, Society and Life. When aggregating per religion, Fig. 4d shows that the jewish editors are the most prolific in all thematic sections (the extended feature set), except Mathematics. muslim editors dedicate higher attention to Belief, Language, Law and INFRA, and lower attention to Arts, Education and Society. Atheist users dedicate more time editing Mathematics, Science, Nature and Culture, and less time to People and Law. This static analysis of mean behavior suggests that there are regularities in the editing patterns for each population. These could be exploited for training a classifier and predicting weather a new user belongs to any of these classes. Evolution of editing patterns. We further study how editing patterns evolve over time. For each feature, we compute the mean number of revisions over each timeframe, broken down by class. This value is still an aggregate measure over an entire subpopulation, but it evolves temporally, therefore hinting changes in editing patterns. Fig. 5 shows examples of temporal evolution for three selected pairs (feature, private trait). More examples are presented in the SI [1]. Fig. 5a shows the feature CONTENT differentiated over levels of education. The static pattern shown in Fig. 4c is a result of change in editing patterns over time: editors with a PhD edit CONTENT more as Wikipedia matures, as also shown in the editor population breakdown in Figure 3a. For other features the editing pattern evolves over time. Contrasting Fig 5b and Fig 4a, we can see the differentiation of edits to the USER section by gender – female users edit more overall than male users, but this is only true until 2009. TALK-C and BELIEF also present the same temporal pattern shift, as shown in the SI [1]. Finally, some features present unexpected trends. Fig. 5c unveils that, unlike the general trend of decreasing number of contributions, CONTENT related edits increase for muslim editors. This temporal analysis reinforces the hypothesis that user editing patterns, as well as their evolution, are differentiated along different user traits.
6.
PREDICTION RESULTS
We predict personal traits of editors using behavioral features described in Sec. 4.1. We report the prediction performance over time using different features. We also perform feature relevance analysis using the information trans-
Prediction of gender (basic features)
Prediction of education - undergrads (basic features)
0.69
0.68
0.68
0.66
0.67
0.64
Area under ROC
0.67 0.66 0.65 0.64 0.63
Area under ROC
0.68
0.66
f(x) = 0.002x + 0.628 R² = 0.838 -14 p-val = 3.219 x 10
0.62 0.61 0.6
Area under ROC
0.7 0.69
0.62
0.65 f(x) = 0.001x + 0.639 R² = 0.945 -13 p-val = 5.882 x 10
0.64
0.56
0.62
0.54
(b)
f(x) = 0.003x + 0.579 R² = 0.871 p-val = 4.055 x 10 -12
0.6
0.58
0.63
(a)
Prediction of religion - Muslim (basic features)
(c)
Figure 6: Temporal evolution of privacy loss, measure using mean AUC value over 20 executions (error bars denote standard deviation). Result of inferring, using binary predictors on the basic feature set, of gender (a), education/undergrads (b) and religion/muslim (c). The results for all the other binary predictors are given in the SI [1]. gender - New Entry vs. Fixed Population (basic features)
6.1
Predicting personal attributes over time
Predictability of private traits improves over time. We train binary predictors for every class of every private trait in our study, on datasets with increasing amounts of history (as shown in Sec. 4.2). The editors’ activity is described using the basic feature set. Fig. 6 shows the AUC over time for three selected examples of private traits (all remaining classes are in the SI [1]). The graphics show the performance of predicting the gender of editors (Fig. 6a), whether they are undergrads (Fig. 6b) or of muslim religion (Fig. 6c). The AUC measure increases over time, roughly following a linear trend (coefficient of determination R2 > 0.83 for all three examples). The AUC differences between the first and the last timeframes are statistically highly significant (t test p < 0.001, details and results in the SI [1]). We interpret this steady increase of performance over time as loss of privacy: as more historical information is available, the learning algorithm infers more accurately user traits which are potentially private. We also report the model performance on an intuitive measure called equal Precision and Recall (ePR), defined as where a 45 degree line from the origin intersect with the precision-recall curve. For the three classes in Fig. 6,
12% 10%
0.65
8% 6%
0.63
4%
0.59
fer metric to pin point the source of performance gain over time.
14%
0.67
0.61
Figure 7: Comparison of the temporal evolution for the privacy loss on gender for the basic and extended feature sets. The extended feature set consistently provides better performances, while the AUC series of the predictors trained on the two feature sets are highly correlated and present the same trends.
Relative gain New entry Fixed population
Relative gain
Pearson correlation coefficient: 0.973
Area unde ROC
0.69
2% 0%
Figure 8: Evolution of privacy loss for the population fixed to its component in the first quarter of 2007 (i.e., no newcomers) and a population in which new users can enter. We quantify the privacy loss due to newcomers as the relative gain of learning performance for the two populations.
in the last timeframe, we obtain a mean ePR of 0.791 (for gender), 0.535 (for education/undergrads) and 0.9 (for religion/muslim). All the other classes and the evolution of ePR over time are found in the the SI [1]). The ePR measure allows to quantify how accurate are the predictions of certain private traits. We find that certain religious attributes have notably high prediction performance (muslim eP R = 0.9 and jewish eP R = 0.913) from behavioral features. In a similar setup, we predict user gender using the extended feature set and we plot the AUC over time in Fig. 7. Alongside, we produce the results for the basic feature set and the relative gain between the two, for each timeframe. Predictions using the extended features consistently outperform those using the basic features, showing that knowledge about thematic editing patterns is informative about user private traits. The AUC over time series for the two types of features sets are highly correlated (Pearson correlation of 0.973). This shows that, while adding the thematic information improves the absolute value of the prediction accuracy, it seems to have little influence on the evolution of the privacy loss. We speculate that the evolution of privacy loss is not linked to the way the data is described, but it is rather intrinsic to the online social environment. To the best of our knowledge, this is the first study to highlight and quantify this intrinsic cumulative effect of time over privacy. Sources of privacy loss: the online breadcrumbs and newcomers. We hypothesize that the temporal pri-
0.9 0.3 0.2
01−2007 04−2007 07−2007 10−2007 01−2008 04−2008 07−2008 10−2008 01−2009 04−2009 07−2009 10−2009 01−2010 04−2010 07−2010 10−2010 01−2011 04−2011 07−2011 10−2011 01−2012 04−2012 07−2012 10−2012 01−2013 04−2013 07−2013 10−2013
01−2007 04−2007 07−2007 10−2007 01−2008 04−2008 07−2008 10−2008 01−2009 04−2009 07−2009 10−2009 01−2010 04−2010 07−2010 10−2010 01−2011 04−2011 07−2011 10−2011 01−2012 04−2012 07−2012 10−2012 01−2013 04−2013 07−2013 10−2013
0.4
0.5
0.6
0.7
0.8
CONTENT LIFE SOCIETY
(d)
(e)
01−2007 04−2007 07−2007 10−2007 01−2008 04−2008 07−2008 10−2008 01−2009 04−2009 07−2009 10−2009 01−2010 04−2010 07−2010 10−2010 01−2011 04−2011 07−2011 10−2011 01−2012 04−2012 07−2012 10−2012 01−2013 04−2013 07−2013 10−2013
01−2007 04−2007 07−2007 10−2007 01−2008 04−2008 07−2008 10−2008 01−2009 04−2009 07−2009 10−2009 01−2010 04−2010 07−2010 10−2010 01−2011 04−2011 07−2011 10−2011 01−2012 04−2012 07−2012 10−2012 01−2013 04−2013 07−2013 10−2013
0.2
0.3
0.04
New Entry - Conditional Entropy evolution Remaining unexplained Entropy
CONTENT LIFE SOCIETY
0.06
0.08
New Entry - Information Transfer evolution
0.02
Information Transfer
(c)
0.00
0.015 0.010 01−2007 04−2007 07−2007 10−2007 01−2008 04−2008 07−2008 10−2008 01−2009 04−2009 07−2009 10−2009 01−2010 04−2010 07−2010 10−2010 01−2011 04−2011 07−2011 10−2011 01−2012 04−2012 07−2012 10−2012 01−2013 04−2013 07−2013 10−2013
0.005
(b)
Instantenous Mutual Information evolution
0.020
Instantenous Mutual Information
0.025
(a)
CONTENT LIFE SOCIETY
0.8 0.6 0.4
0.5
0.10 0.00
CONTENT LIFE SOCIETY NATURE CULTURE SCIENCE APPLIED.SCIENCES HUMANITIES BUSINESS POLITICS TECHNOLOGY HISTORY GEOGRAPHY ARTS LANGUAGE EDUCATION CHRONOLOGY HEALTH PEOPLE USER BELIEF LAW ENVIRONMENT MATHEMATICS TALK.C TALK.U AGRICULTURE WIKI INFRA
0.05
0.1 0.0
0.7
0.15
Information Transfer
0.2
CONTENT LIFE SOCIETY
Remaining unexplained Entropy
0.20
CONTENT LIFE SOCIETY
0.4 0.3
Fixed population - Conditional Entropy evolution
Fixed population - Information Transfer evolution
Conditional entropy of gender, on each descriptive feature 0.5
(f)
Figure 9: (a) Conditional entropy after conditioning on each feature (with New Entry NE). Feature are ordered by the conditional entropy H(Y |X), here Y = 0/1 is the gender attribute, X is each feature. Lower is better. Basic features are shown in red, extended features in blue. (d) Mutual Information between each feature at time t and gender (on NE). Information Transfer on Fixed Population FP (b) and NE (e). Conditional entropy evolution on FP (c) and NE (f). We can see that while later edits contains just as much information about a user’s privacy as the earlier edits, they contribute less to prediction gain, since most of the information they bring was already known. vacy loss is caused by a joint effect of two factors: i) the information learned from new users who enter the population and ii) online breadcrumbs – a more accurate estimate of how the behavioral features correspond to personal traits. To separate these factors, we study two scenarios defined by subsets of the editor population: the “New Entry” (NE) in which new users can enter freely throughout time and “Fixed population” (FP), which is limited only to users active in the first timeframe (i.e., first quarter of 2007). In Fig. 8, we plot the AUC over time when studied on FP and we compare it to NE. The curve corresponding to FP increases over time, though slower in later timeframes. The difference between the first and last timeframe is statistically very significant – t test p < 0.01. Intuitively, only the online breadcrumbs could cause this privacy decay for FP. By comparison, predictions on NE are constantly more accurate and continue to notably improve beyond the initial burst detected for FP. We attribute this additional improvement to information relating to new users entering the system. Are later edits less harmful to one’s privacy? Intuitively, the longer a user edits, the more she discloses about herself. We approach this issue using the Temporal Information Theory measure detailed in Sec. 4, on both FP and NE populations. Let T be the number of timeframes. Fig. 9a shows the conditional entropy of the class variable Y (here gender), after conditioning on all the temporal instantiations of a given feature X (i.e. H(Y |X1:T )), on the NE dataset. We can see which features which give out the most information about a user’s gender. Consistent with the data profile in Sec. 5, CONTENT differentiates the most males from females.
The thematic features, like Life, Society, Nature or Culture follow, having very similar scores. Surprisingly, USER distinguishes gender less than seen from the profiling analysis. An almost identical ordering of importance of features is obtained on FP. Fig. 9d presents the instantaneous mutual information over time between the gender variable and the three most important features. All three present very similar dynamics (both on NE and on FP). The information overlap between the temporal instantiation of features and the private trait remains almost constant until close to the end of the studied period. This answers the questions whether later edits are less harmful, by showing that later activity hurts privacy as much as the initial activity, as it discloses similar quantities of information. We further study the amount of new information introduced features in later timeframes. Fig. 9b and 9e show that the Information Transfer over time on respectively FP and NE, for the same three features. All series present an initial burst, after which they drop quickly. Note that, due to differences in the effectives of the FP and NE populations, the absolute values are not directly comparable and only their evolutions are meaningful. We can see that later features Xt bring few new information not already disclosed by the earlier features Xt−1 , Xt−2 . . .. The take-home message for this subsection is: While later edits contain just as much information about a user’s privacy as the earlier edits, they tend to be less harmful since most of the information they bring has already been learned. The continuous impact of newcomers on privacy loss. Information Theory measures provide means for separately quantifying the privacy loss due “online breadcrumbs”
and newcomers. For FP (Fig. 9b), the utility of later edits drops to virtually zero, whereas for NE (Fig. 9e) they decrease to a non-negligible score. The information inferred from newcomers seems to be moderate, but constant in time. This information is also responsible for the continuous increase of prediction performance detected in Fig. 8 for the NE population. Similar conclusions can be drawn by studying the conditional entropy over time for the two populations. For FP (Fig. 9c) it decreases rapidly and remains constant afterwards, showing that virtually no new information is learned after the initial burst. For NE (Fig. 9f), it continues to decay even after the initial burst, though at a slower pace.
6.2
Predicting the attributes of exited users
Privacy continues to erode even for retired users. We further analyze what happens to the privacy of users who left the system. After their retirement, no more useroriginating information (“online breadcrumbs”) is available to disclose private traits. We quantify the prediction performance on a user population who have edited prior to 01.01.2008, but stopped after this date. Therefore, any information introduced in the system by the users themselves is restricted to the timeframes before 2008. In Fig. 10a, we plot the AUC over time for the education/undergrads binary classifier. We observe an constant increase of prediction performance, even though it shows a saturation in later timeframes. An increase of prediction performance for for retired users is also observable for religion/christian (see the SI [1]), but not for any of the other binary classifiers. No longer being in activity, the loss of privacy after 01.2008 is not the result of the users’ actions. A Temporal Information Theory analysis performed on this exited population shows Information Transfer values of zero after 01.2008 – i.e. no information originating with “online breadcrumbs”. As far as we know, these are the first results to show that certain private traits could be predicted increasingly better even after the users exited the system. Why does privacy degrade for exited users? Intuitively, user originating information is available only until the exit of the users, afterwards the source of new information are in the actions of other users. Predictions about unseen retired users can be made only using features relating to their period of activity (here 01/2007 - 12/2007). In the logistic regression models learned at each timeframe, the strength of the links between features and the class variable are given by the corresponding coefficients. We study the coefficients of features which encode the activity of users prior u to their retirement (i.e. X1:4 ). We show the coefficients relating to CONTENT (in Fig. 10b) and to USER (in Fig. 10c), in models learned for education/undergrads in each timeframe. In later timeframes, CONTENT features observe an increase in importance, with both CONTENT2 (number of CONTENT revisions in the 2nd quarter of 2007) and CONTENT3 steadily increasing from being completely absent in the initial models. CONTENT4 remains absent for all timeframes. Simultaneously, USER features decrease in importance, with USER4 disappearing completely. We hypothesize that the AUC increase observed after 01.2008 originates with the currently active users whose activity overlapped with the exited users: classifier learns from users active both before and after 01.2008, by modifying the weights of features, including those before 01.2008. In this case, they learn that CONTENT features should have more importance, while USER features
should have less. In summary, predicting of personal traits increase even for retired users, and the key factor for this improvement is better estimates on a subset of important features such as CONTENT.
7.
DISCUSSION
We present a first study to quantify the extent of gradual privacy erosion over six years. First, we set up a large scale evaluation using Wikipedia editing behavior to predict private traits over time. We analyze a 13 year history of Wikipedia edits made by more than 117 thousand users. Our descriptive analysis showed that as Wikipedia evolves, editors of different personal traits shows distinct patterns in their volume of edits and topical preferences. Second, we provide experimental evidence that time has an adverse effect on privacy. We show that prediction performance on private traits, such as gender, education and religion, increases with longer Wikipedia editing history. Third, we show that prediction of private traits improves even for users who have stopped editing Wikipedia. We further quantify the effect of predictability, and found that the improved performance can be attributed to two factors: new editors of Wikipedia, and better estimate of feature relevance - with the first having a larger effect. To the best of our knowledge, this is the first study to quantify the change of private traits over time, using Wikipedia, an open online dataset containing behavior breadcrumbs. This work shall raise awareness in the public on the privacy implications of online activity over long periods of time. The fact that the information of newcomers can be used to learn more about existing members has profound implications: users do not have complete control over the consequences of the information they release. Reflecting on this work, we would like to discuss a few of its limitations, practical implications and connections to other areas of research. What does it really mean for Wikipedia users, should they be worried? To the best of our knowledge, no studies have shown that the real identity of Wikipedia users can be revealed based on their editing activity. However, this may not the case for other social networks. User disclosure bias. This work uses self-disclosed personal traits on users’ public profile. It is well-known that such a data source is prone to users’ disclosure biase. The set of users who voluntarily disclose private information might be biased towards users less concerned with their privacy, who in turn has a distinct behavioral pattern. Validating the effect of such a bias would require an alternative source of groundtruth and maybe even behavioral data, and is beyond the scope of this study. What about other online platforms? Although we predict predict private traits from Wikipedia, similar prediction results (on a static snapshop) was reported for other platforms such as Facebook [12]. Being a collaborative encyclopedia, Wikipedia records relatively small amount of information about its users. More detailed longitudinal analyses could be performed on platforms such as Facebook, and it is likely to also see an increasing trend for predicting personal traits. A natural law of evolution of privacy loss. This study provides empirical analysis about privacy loss. A longer-term open challenge is a physical model for privacy loss, i.e., predict the de-anonymization rate of a given anonymized dataset.
0.04
0.602
0.03
0.600
0.02
0.598
0.01
0.596
0.604
0.30 0.25
0.602
0.20
0.600 0.15
0.598
0.10 0.05
0.596
CONTENT3
(b)
Prediction performance
USER3
USER2
USER4
10−2013
07−2013
04−2013
01−2013
10−2012
07−2012
04−2012
01−2012
10−2011
07−2011
04−2011
01−2011
10−2010
07−2010
04−2010
01−2010
10−2009
07−2009
04−2009
01−2009
10−2008
07−2008
10−2013
07−2013
04−2013
01−2013
10−2012
07−2012
04−2012
01−2012
10−2011
07−2011
04−2011
01−2011
10−2010
07−2010
04−2010
01−2010
10−2009
07−2009
04−2009
01−2009
10−2008
07−2008
04−2008
01−2008
lastActive
CONTENT2
(a)
04−2008
0.00
0.00
Period of inactivity
0.606
Prediction performance (auc)
0.604
0.35
01−2008
0.595
0.05
Feature coefficient in model
Area under ROC
0.600
Prediction performance (auc)
Feature coefficient in model
0.606
0.605
0.590
Learned coeff. of USER feature in prediction models
Learned coeff. of CONTENT feature in prediction models
lastActive
education - undergrads - users inactive after 01/2008
0.610
Prediction performance
(c)
Figure 10: (a) Increase of prediction performance for for users retired after 01.2008 – education/undergrads (other in SI [1]). Coefficients one model per timeframe: (b) CONTENT coefficient (absent in the model corresponding to the dataset where users were last active) increase in importance. (c) USER relating features decrease in importance. Effective conditions for preserving privacy? There is a growing literature on characterizing which privacy guarantees can be obtained under a given information release protocol [25]. We hope our findings invite new empirical and theoretical investigation into the case in which data release is spatio-temporal and heterogeneous across different entities. In light of this study, we advocate that new means should be found to tackle the issue of online privacy. We argue one feasible means to preserving privacy is to construct laws which would enable erasing the recorded activity in the online environment. Acknowledgments NICTA is funded by the Australian Government through the Department of Communications and the Australian Research Council through the ICT Centre of Excellence Program. This material is based on research sponsored by the Air Force Research Laboratory, under agreement number FA2386-15-1-4018.
8.
REFERENCES
[1] Supplementary material: Evolution of privacy loss in Wikipedia, 2015. http://goo.gl/JT6WK7. [2] A. Acquisti, L. K. John, and G. Loewenstein. What is privacy worth? The Journal of Legal Studies, 42(2):249–274, June 2013. [3] R. Almeida, B. Mozafari, and J. Cho. On the evolution of Wikipedia. In ICWSM ’07, 2007. [4] D. Barth-Jones, K. E. Emam, J. Bambauer, a. Cavoukian, and B. Malin. Assessing data intrusion threats. Science, 348(6231):194–195, Apr. 2015. [5] d. boyd and A. E. Marwick. Social Privacy in Networked Publics: Teens’ Attitudes, Practices, and Strategies. A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society, sep 2011. [6] C. Danescu-Niculescu-Mizil, L. Lee, B. Pang, and J. Kleinberg. Echoes of power: Language effects and power differences in social interaction. In WWW, page 699, 2012. [7] Y.-A. de Montjoye, C. A. Hidalgo, M. Verleysen, and V. D. Blondel. Unique in the crowd: The privacy bounds of human mobility. Scientific Reports, 3, 2013. [8] A. Gibbons, D. Vetrano, and S. Biancani. Wikipedia: Nowhere to grow. Tech. report, Standford, 2012. [9] A. Halfaker, R. S. Geiger, J. T. Morgan, and J. Riedl. The Rise and Decline of an Open Collaboration System: How Wikipedia’s Reaction to Popularity Is Causing Its Decline. American Behavioral Scientist, 57(5):664–688, Dec. 2012.
[10] C. J. Hoofnagle, J. King, S. Li, and J. Turow. How Different are Young Adults from Older Adults When it Comes to Information Privacy Attitudes and Policies? Ssrn scholarly paper, Apr. 2010. [11] G. James, D. Witten, T. Hastie, and R. Tibshirani. An Introduction to Statistical Learning, volume 103 of Springer Texts in Statistics. Springer New York, 2013. [12] M. Kosinski, D. Stillwell, and T. Graepel. Private traits and attributes are predictable from digital records of human behavior. PNAS, 110(15):5802–5805, 2013. [13] D. J. MacKay. Information theory, inference and learning algorithms. Cambridge university press, 2003. [14] A.-M. Meyer and D. Gotz. A new privacy debate. Science, 348(6231):194–194, Apr. 2015. [15] P. E. Meyer. R package ’infotheo’, 2014. [16] Y.-a. D. Montjoye and a. S. Pentland. Assessing data intrusion threats–Response. Science, 348(6231):195–195, Apr. 2015. [17] Y.-a. D. Montjoye, L. Radaelli, V. K. Singh, and a. S. Pentland. Unique in the shopping mall: On the reidentifiability of credit card metadata. Science, 347(6221):536–539, Jan. 2015. [18] A. Narayanan, E. Shi, and B. I. P. Rubinstein. Link prediction by de-anonymization: How we won the kaggle social network challenge. In IJCNN, pages 1825–1834, 2011. [19] A. Narayanan and V. Shmatikov. Myths and fallacies of personally identifiable information. Comm. of the ACM, 53(6):24–26, June 2010. [20] A. Ramachandran and A. Chaintreau. The Network Effect of Privacy Choices. In Workshop EcoNet, pages 1–4, 2015. [21] J. Saramaki, E. A. Leicht, E. Lopez, S. G. B. Roberts, F. Reed-Tsochas, and R. I. M. Dunbar. The persistence of social signatures in human communication. PNAS, 2014. [22] B. Suh, G. Convertino, E. H. Chi, and P. Pirolli. The Singularity is Not Near: Slowing Growth of Wikipedia. In WikiSym ’09, pages 8:1–8:10. ACM, Oct. 2009. [23] L. Sweeney. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05):557–570, 2002. [24] G. Ver Steeg and A. Galstyan. Information transfer in social media. In WWW ’12, page 509, 2012. [25] L. Wasserman and S. Zhou. A statistical framework for differential privacy. Jour. of American Stat. Assoc., 105(489):375–389, 2010. [26] H. T. Welser, D. Cosley, G. Kossinets, A. Lin, F. Dokshin, G. Gay, and M. Smith. Finding social roles in wikipedia. In iConference, pages 122–129. ACM, 2011. [27] W. Youyou, M. Kosinski, and D. Stillwell. Computer-based personality judgments are more accurate than those made by humans. PNAS, 2014.
Publication venue: WSDM ’16
Research Article
Supplementary Material: Evolution of privacy loss in Wikipedia Main article DOI: 10.1145/2835776.2835798
Marian-Andrei Rizoiu 1,∗ , Lexing Xie 2 , Tiberio Caetano3 and Manuel Cebrian 4 1 NICTA & Australian National University, Canberra, Australia 2 Australian National University & NICTA, Canberra, Australia 3 Ambiata, ANU and UNSW, Sydney, Australia 4 NICTA & University of Melbourne, Melbourne, Australia Correspondence*: Marian-Andrei Rizoiu NICTA Research Lab, 7 London Circuit, Canberra, Australia,
[email protected]
We provide in this document detailed information about: • the construction of the Wikipedia dataset used in the Main Text. Additionally, we
provide links for downloading the user dataset (useful for reproducing the prediction results) as well as the complete edit dataset; • additional data profiling figures, completing the narrative in the Main Text and provided for completeness; • additional prediction performance measuring.
CONTENTS 1
2 3 4 5
The construction of the Wikipedia dataset 1.1 The Wikipedia categorization system and the extended features. 1.2 Constructing the editor’s private data: the private traits. 1.3 Downloading the Wikipedia dataset Additional data profiling graphs Statistical testing of increasing Privacy Loss Computing other performance measures: FScore and “equal Precision and Recall” Additional prediction performance graphs
1 2 2 3 5 5 6 7
1 THE CONSTRUCTION OF THE WIKIPEDIA DATASET Wikipedia uses internally a revision system, which records every edit or modification made to a page by a user. Such atomic operations are called “revisions” in Wikipedia’s vocabulary. The version of a page at any given moment in time can be obtained by overimposing all the revisions made to the page from the beginning of time until the given moment. The entire history of revisions is publicly available for download 1 . The user descriptive features introduced in the main article were constructed starting from the July 2013 English Wikipedia stub dump. This dump contains all the history of Wikipedia, 1
Download Wikipedia dumps: https://dumps.wikimedia.org/enwiki/latest/
1
Rizoiu et al.
Supplementary Material
starting from its creation in January 2001 until July 2013. We record for each individual revision: its editor, target page and timestamp. For purposes of internal organization, Wikipedia page are assigned into namespaces, which are categories based the intended purposes of pages: main articles, talks around article, user pages, user talks, community pages, project pages etc. The basic descriptive features for users are constructed based on the Wikipedia namespaces of edited pages, as shown in the main article. 1.1
THE WIKIPEDIA CATEGORIZATION SYSTEM AND THE EXTENDED FEATURES.
We construct the extended descriptive features, based on the Wikipedia categories2 , which are a categorization system, based on the theme of the articles. This categorization system defines main categories, such as Geography, History, Arts, which can be further divided into finer subcategories. This forms a shallow hierarchy, i.e., a hierarchy with a low number of levels. Loops are allowed inside the hierarchy – a given category can be a subcategory of multiple parent categories –, though discouraged, and the resulting category graph does not posses a strict tree structure. We select the 23 main categories of Wikipedia to serve as thematic features in a user’s editing description. Each Wikipedia article can be placed under one or multiple (sub-)categories. Every time a user edits a page, we propagate the resulted revision through the category graph by following the category parent relation. The user’s revision counts for all reachable main topics are incremented. We avoid the infinite propagation through the loops using a propagation threshold, equal to the average “height” of the hierarchy. While not a proper hierarchy, we define the “up” propagation direction as towards the main categories. 1.2
CONSTRUCTING THE EDITOR’S PRIVATE DATA: THE PRIVATE TRAITS.
User pages are similar to regular Wikipedia pages, with the difference that they are dedicated to users. Each user has, by default, a user page in Wikipedia, which serves the role of a “social profile” as in an online social network. The associated talk pages are employed for private discussions. The user pages and the user talk pages form Wikipedia’s social aspect, which has been shown to be a prolific environment for activities such as campaign for internal elections (Danescu-Niculescu-Mizil et al., 2012). All information that we use as private information is provided by some of the editors themselves, on their personal user pages. By adding labels to their user pages, editors give information about their geographic location, nationality, religion, profession, education level, philosophy or even sexual preferences. Similar to the aforementioned page categories, user categories are re-constructed bottom-up, based on the label information scrapped from the user pages using the Wikipedia API. As an example about the type of information that we can obtain about editors, we show the visualization of the first level child nodes of the categories providing information about the editor’s location (Figure 2a), profession (Figure 2b) and religion (Figure 3). Some of the categories are further divided into subcategories, which provide increasingly fine-grained information. The three categories selected as private traits in the main article (i.e., gender, religion and education) were chosen for being closer to what humans perceive as private information, as opposed to spoken languages or geographic location. We selected only a subset of subcategories out of all available subcategories (e.g., christian, muslim, atheist and jewish for religion) in order to have a reasonable number of editors to perform the analysis on. We restrict our dataset to only registered users for which at least one private information is retrieved from their user page. Summarizing, the dataset includes for each user: • a description of the user’s editing activity, recorded down to revisions on individual
pages, together with their timestamp;
2
Wikipedia categories descriptions: https://en.wikipedia.org/wiki/Portal:Contents/Categories
2
Rizoiu et al.
Supplementary Material
• some private information, retrieved from their own user pages, under the form of user
categories.
The dataset contains: • • • • • •
1.3
188,805,088 revisions 117,523 users 8,679 user categories (and their hierarchical relations) 22,172,813 edited pages 430,410 page categories (and their hierarchical relations) extent of time: beginning of Wikipedia (January 2001) until July 2013. DOWNLOADING THE WIKIPEDIA DATASET
We make available for the public to download the Wikipedia dataset used in this work. We provide two versions of the dataset: the temporal user editing behavior dataset and complete editing dataset. Temporal user editing behavior dataset Download (∼82 MB): http://goo.gl/Tx5SoI Description: Dataset containing user activity over time and the three private traits used in this work: gender, education and religion. Usage: This dataset is provided so that any third party can reproduce the experiments described in this work. Technical description: This dataset captures the editing activity of Wikipedia users from 01.2002 until 07.2013. It is comprised of 47 CSV (“Comma Separated Values” format) files, one for each quarterly timeframe. Each file contains 117,523 lines (plus a header line), describing the editing activity of each of the users in our dataset, from the beginning of Wikipedia until the beginning of the timeframe corresponding to the given file. The columns in the CSV file (starting from the 6th column onwards) correspond to the basic and extended features, described in the Main Text, Sec. 4.1. For example, column AGRICULTURE in file “2003 07 dataset.csv” gives, for each editor, the total number of revisions relating to AGRICULTURE, performed from the beginning of Wikipedia until June 31st 2003. Columns ALL gives the total number of revisions performed by an editor. Columns gender, education and religion correspond to the private traits of the editors. The content of these columns does not vary between the different temporal datasets. Fields in these columns take values when information is known about the users, otherwise they take the value NA (not available). Complete editing dataset Download: 1% Sample (∼495 MB): http://goo.gl/T47UVj Complete (∼3.6 GB): http://goo.gl/2iLH7A Description: Dataset containing information about Wikipedia pages, categories, user, user categories and revisions, under a relational database format. Usage: This dataset is provided to allow extensions to current work and/or to learn new insights for this data. Technical description: The complete editing dataset is provided as a SQL relational database, with the table schema presented in Fig. 1. It contains three main tables: • user – gives the names and total number of edits for each used in the dataset; • page – provides information about the Wikipedia pages in the dataset (title,
namespace, creation date etc.);
3
Rizoiu et al. pagecategory id name level
pagesubcategory parent_id child_id
subcategory parent_id child_id
category id name level
Supplementary Material pageincategory page_id category_id
page id title namespace creation_date last_edit_date total_edits
revision id page_id user_id timestamp parentid text
userincategory user_id category_id
user id name edits
Figure 1. SQL table schema, showing the relations between the tables in the “complete editing dataset”. Table names are shown on blue background, primary keys are shown on red background. Arrows indicate foreign key relations, with the direction from a given field to its corresponding foreign key. • revision – records atomic edits (revisions) made on page page id by user user id
at time timestamp. The field text is always empty since we work with the structure of Wikipedia only and not with the text. This field exists here to allow future enhancements. With more than 188 million revisions, this table is the bulk of the dataset. For easier handling and speed of prototyping, we provide the “1% Sample” dataset, which contains only 1% of the revisions in the “Complete” dataset (the content of all the other tables is identical).
The remaining 6 tables describe the page/user categories and their relations. Table pagecategory gives information about page categories, which are constructed based on the category system described in Sec. 1.1. Similarly, table category provides information about user categories (described in Sec. 1.2), which serve as private traits. Table pageincategory (userincategory) gives the n-to-m relations between pages (users) and their respective categories. Table pagesubcategory (subcategory) encodes the hierarchic relation between categories.
4
Rizoiu et al.
Supplementary Material
2 ADDITIONAL DATA PROFILING GRAPHS As indicated in the main article, we present hereafter additional graphs showing: • Fig. 4 and 5: The decline of editorship and the rise of maintenance, number of total
revisions, new and Thematic features all present very similar evolutions to those presented here and in the main article. • Evolution of editing patterns for gender (Fig. 6 and 7), education (Fig. 8 and 9) and religion (Fig. 10 and 11). Pairs (feature, private trait) for all the basic and extended features. The number of revisions performed by editors during each each quarterly timeframe on each feature are computed as percentages of the total number of revisions (ALL), and the mean value over all users is presented.
3 STATISTICAL TESTING OF INCREASING PRIVACY LOSS In the Main Text Sec. 6.1, we show that Privacy Loss follows an increasing trend. The prediction performances, measured using the AUC ROC indicator James et al. (2013), obtained for later timeframes are higher compared to those of earlier timeframes. This evolution is shown in Main Text Fig. 6 and completed in SI Fig. 13. In order to rigorously test this increasing trend, we perform statistical testing analysis. For each timeframe, we train the Logistic Regression 20 times (cf. testing protocol described in Main Text Sec 4.2). We perform a 2-sample two-sided t test between the AUC ROC results obtained for the first and the last timeframe, for each binary predictor. The null hypothesis states that the measure for both timeframes has the same mean value (i.e. no Privacy Loss), whereas the alternative hypothesis states that the mean values for the two timeframes are different. Table 1 shows the p-values of the performed t tests, as well as the significance level: * significant (p < 0.05), ** very significant (p < 0.01) and *** highly significant (p < 0.001). With the exception of religion/jewish, the increase of prediction performance for all binary predictors is statistically significant. Furthermore, we study how significant is the increase of prediction performance of the Fixed Population (FP) described in Main Text Sec. 6.1. The evolution of the AUC ROC measure, depicted in Main Text Fig. 8, is increasing in early timeframes and plateaus afterwards. By performing statistic testing, we show in Table 1 that even the loss of privacy due to the users’ own behavior is statistically highly significant — line “gender FP”. Table 1. Hypothesis testing of the statistical significance of the increase of learning performance. For each trained predictor, we show the p-value and the significance level. class
religion
educat.
gender gender FP
p-value 3.219 × 10−14 1.262 × 10−4
*** ***
1.018 × 10−3 6.916 × 10−12 1.064 × 10−1 4.055 × 10−12
** ***
undergrads 5.882 × 10−13 grads 2.016 × 10−6 PhD 1.496 × 10−2 atheist christian jewish muslim
Significance level
*** *** *
***
5
Rizoiu et al.
Supplementary Material
4 COMPUTING OTHER PERFORMANCE MEASURES: FSCORE AND “EQUAL PRECISION AND RECALL” In statistics, the receiver operating characteristic (ROC curve) is a graphical plot that illustrates the performance of a binary classifier system, while varying its discrimination threshold. When using normalized units, the area under the curve (the AUC measure) is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming ’positive’ ranks higher than ’negative’) (Fawcett, 2006). One of the advantages of the AUC measure, used in the Main Text to evaluate learning performances over each private trait, is that it does not require setting a decision threshold. Other measures, such as Accuracy, Precision, Recall or FMeasure, require establishing a decision boundary in order to decide if a given example belongs (or not) to a given class. We devise the following protocol: for the predictions of a given model, we perform a line search for the decision threshold. For F-Measure, we search for the point which maximizes the measure. For the point of equal Precision and Recall (ePR), we choose the boundary for which Precision equals Recall. This measure is also known as the Precisionrecall breakeven point. Once a boundary has been chosen, the given prediction is scored using the value of the measure at that point (i.e. the F-Measure or Precision/Recall). According to the testing protocol described in the Main Text Sec 4.2, each classifier is trained 20 times. We calculate and report the mean value and the standard deviation. Table 2 presents these statistics, computed for the latest available snapshot (July 2013), alongside with each class prior (for gender, we present the prior of male). Three traits seem to be inferred particularly accurate: gender, religion/jewish and religion/muslim. The evolution over time of the equal Precision/Recall of each attribute is presented in Fig. 12. Two attributes present a noteworthy evolution (clearly outside the standard deviation denoted by error bars): gender can be predicted increasingly accurate overtime, while religion/muslim decreases slightly, probably due to the increase of active self-declared muslim editors. Table 2. Performance of prediction of each private traits in the last timeframe (10/2013). Mean and standard deviation (in italic fontface) of F-Measure and “point of equal Precision and Recall” are calculated over 20 random stratified splits (each class prior is given in the “Class prior” column).
religion educat.
class
Class prior
F-Measure
gender (m)
0.719
0.840
undergrads grads PhD
0.420 0.493 0.086
atheist christian jewish muslim
0.324 0.451 0.107 0.116
0.605 ±0.007 0.701 ±0.001 0.239 ±0.018 0.796 0.704 0.952 0.937
ePR
±0.001
0.791 ±0.006
±0.001 ±0.000 ±0.000 ±0.000
0.693 0.555 0.913 0.900
0.535 0.607 0.175
±0.012 ±0.008 ±0.030 ±0.008 ±0.008 ±0.002 ±0.002
6
Rizoiu et al.
Supplementary Material
5 ADDITIONAL PREDICTION PERFORMANCE GRAPHS • Figure 13: Editing Wikipedia discloses private information. AUC evolution on the
remaining classes of private traits, apart from those presented in the Main Text.
• Figure 14: In addition to education/undergrads information revealed about editors
retired after 01.2008 (shown in Main Text Fig. 10a), we present also the increase of prediction performance for religion/christian. No other binary classifier presents an increase of AUC score over time.
REFERENCES Danescu-Niculescu-Mizil, C., Lee, L., Pang, B., and Kleinberg, J. (2012), Echoes of power: Language effects and power differences in social interaction, in World Wide Web, Proceedings of, WWW ’12, 699–708 Fawcett, T. (2006), An introduction to roc analysis, Pattern recognition letters, 27, 8, 861–874 James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013), An Introduction to Statistical Learning, volume 103 of Springer Texts in Statistics (Springer New York), doi:10.1007/ 978-1-4614-7138-7
7
Rizoiu et al.
Supplementary Material Wikipedians in Germany
Wikipedians in Canada
Wikipedians in Colombia
Wikipedians in Africa
Wikipedians in Northern America Wikipedians in Nicaragua
Wikipedians in Panama
Wikipedians in Guyana
Wikipedians in Antarctica
Wikipedians in Brazil
Wikipedians in Dominica
Wikipedians in Malawi
Wikipedians from Lake Constance
Wikipedians in Oceania
Wikipedians in Bolivia
Wikipedians in Paraguay
Wikipedians in Saint Lucia
Wikipedians in the Confederate States
Wikipedians in Portugal
Wikipedians in Chile
Wikipedians in Costa Rica
Wikipedians in Bangladesh Wikipedians in Polynesia
Wikipedians in Puerto Rico
Wikipedians in Melanesia
Wikipedians in South America
Wikipedians in the Dominican Republic
Wikipedians in Honduras
Wikipedians in Macau Wikipedians in Central America
Wikipedians in the United Kingdom Wikipedians in the British Virgin Islands
Biome of residence user templates Wikipedians in the United States
Wikipedians in Cyprus
Wikipedians in Venezuela
Wikipedians by location Wikipedians in Greenland
Wikipedians in the United States Virgin Islands
Wikipedians in the Cayman Islands Wikipedians in Jamaica
Wikipedians in Spain
Wikipedians in Barbados Guadeloupe Wikipedians
Wikipedians in Guatemala
Wikipedians in Trinidad and Tobago
Wikipedians in Europe Wikipedians in Cuba
Wikipedians in Bermuda
Wikipedians in Argentina Wikipedians in Mauritius
Wikipedians in Tanzania
Wikipedians in Sri Lanka
Wikipedians in Montenegro
Wikipedians in the Caribbean
Wikipedians in Middle East Wikipedians in Hong Kong
Wikipedians in Belize
Wikipedians in Asia Wikipedians in Somalia Wikipedians in Ecuador Wikipedians in India
Wikipedians in Slovenia
Wikipedians in Kurdistan Wikipedians in Peru
Place of residence user templates
Wikipedians in Uruguay
Wikipedians in Mexico
Wikipedians in New England
Wikipedians in Russia
Wikipedians in North America
(a) Wikipedian nurses
Wikipedian informaticians
Wikipedian actors
Wikipedian anthropologists
Wikipedian politicians
Wikipedian composers
Wikipedian pigeon fanciers
Wikipedian ecologists
Wikipedian chefs
Wikipedian prehospital care workers
Wikipedian publishers
Wikipedian lifeguards Wikipedian architectural historians
Wikipedian cab drivers
Wikipedian archivists
Wikipedian genealogists
Wikipedian stagehands Wikipedian financial planners
Wikipedian pharmacists
Wikipedian audio engineers
Wikipedian bankers
Wikipedian scientists Wikipedian photographers
Wikipedian geographers
Wikipedian search engine optimizers Wikipedian sociologists
Wikipedian entrepreneurs
Wikipedian architects
Wikipedian theologians
Wikipedian artists
Wikipedian video game developers Wikipedian crystallographers
Wikipedian historians Wikipedian theatre technicians
Wikipedian military people Wikipedian lawyers
Wikipedian emergency responders
Wikipedian filmmakers
Wikipedian educational technologists
Wikipedian librarians
Wikipedian comedians Wikipedian playwrights
Wikipedian beekeepers
Wikipedian consultants
Wikipedian broadcasters Wikipedian quality assurance specialists Wikipedian curators
Wikipedians by profession
Wikipedian psychologists
Wikipedian questioned document examiners Wikipedian stage managers
Wikipedians who are meteorologists
Wikipedian Insolvency Practitioners Wikipedian Career Development Practitioner
Wikipedian air traffic controllers
Wikipedian social workers
Wikipedian structural engineers
Wikipedian students
Wikipedian accountants
Wikipedian record producers
Wikipedian caregivers
Wikipedian acupuncturists
Wikipedian DJs Wikipedian philosophers
Wikipedian firefighters
Wikipedian mathematicians
Wikipedian physicians
Wikipedian law enforcement workers
Wikipedian economists
Wikipedian dentists
Wikipedian professional writers
Wikipedian human resources workers
Wikipedian journalists
Wikipedian poets
Wikipedian web developers
Wikipedian 3d artists
Wikipedian engineers
Wikipedian archaeologists Wikipedian electricians Wikipedian mariners
Wikipedian linguists
Wikipedian translators
Wikipedian farmers
Wikipedian actuaries
Wikipedian graphic designers
Wikipedian web designers
Wikipedian physical therapists
Wikipedian teachers
Wikipedian retail workers Wikipedian cartographers
Wikipedian clergy
Wikipedian heritage museum workers Wikipedian animators
Wikipedians in government
Wikipedian neuroscientists Wikipedian computer scientists
(b)
Figure 2. Visual representation of first level user categories that can be extracted from the Wikipedia user pages, relating to geographic location (a) and profession (b). Some categories are further subdivided into subcategories (not shown in this figure)
8
Rizoiu et al.
Supplementary Material
Muslim Wikipedians Wikipedians who follow Meher Baba Unitarian Universalist Wikipedians
Christian Wikipedians Tantric Wikipedians
Satanist Wikipedians
Gnostic Wikipedians
Agnostic Wikipedians
Jain Wikipedians
Taoist Wikipedians
Bahá'í Wikipedians Discordian Wikipedians Chaldean Wikipedians Celtic Reconstructionist Pagan Wikipedians Buddhist Wikipedians Eckist Wikipedians Shintoist Wikipedians Atheist Wikipedians Pastafarian Wikipedians Rajneeshee Wikipedians
Wikipedians by religion
Theravada Wikipedians Rodnover Wikipedians
SubGenius Wikipedians Kabbalist Wikipedians
Wikipedians by philosophy Wiccan Wikipedians Druid Wikipedians Scientologist Wikipedians
Universal Life Church Wikipedians Sikh Wikipedians Raelian Wikipedians
Pantheist Wikipedians Thelemite Wikipedians
Zoroastrian Wikipedians Hindu Wikipedians
Jewish Wikipedians
Pagan Wikipedians
Figure 3. (cont. of Figure 2) Visual representation of user categories that can be extracted from the Wikipedia user pages, relating to editors’ religion.
9
25 0.05
0.00
INFRA − Active/New editors and #revisions 0.4
0.3
0.2
2 0.1
0 0.0
Active editors New editors Total revisions 0.8
15 0.6
10 0.4
5 0.2
0 0.0
30
Active editors New editors Total revisions 2.5
2.0
20 1.5
15 1.0
10
5 0.5
0 0.0
Figure 4. The number of revisions added to Wikipedia is constantly descending, as well as the number of active editors and new editors. Showing the evolution, broken down by feature. (m) 15
LAW − Active/New editors and #revisions
25
20
SCIENCE − Active/New editors and #revisions
25
Active editors New editors Total revisions
10
4
0
(d)
Active editors New editors Total revisions
5
0
(g)
Active editors New editors Total revisions
15
10
5
0
(j)
Active editors New editors Total revisions
0
(n)
(b)
14
TALK.U − Active/New editors and #revisions 0.8
0.6
8 0.4
6
2
0.2
0.0
0.3
10 0.2
0.1
0.0
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
1.5
20
15 1.0
10
5 0.5
0.0
10
(e)
AGRICULTURE − Active/New editors and #revisions
25
(h)
HEALTH − Active/New editors and #revisions
25
20
Number of editors (in thousands) 10 0.6
5 0.4
0.2
0 0.0
12
Active editors New editors Total revisions 0.5
8
6
0.4
4
0.3
2
0.2
0
0.1
0.0
30
Active editors New editors Total revisions 2.5
20
2.0
15
10
1.5
5
1.0
0
0.5
0.0
20
Active editors New editors Total revisions 2.0
15 1.5
10 1.0
5 0.5
0 0.0
(k) (l)
LANGUAGE − Active/New editors and #revisions BELIEF − Active/New editors and #revisions
Active editors New editors Total revisions 1.0
15 0.8
10 0.6
5 0.4
0 0.2
0.0
Number of revisions (in millions)
0.8
Number of revisions (in millions)
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
04
0
Number of revisions (in millions)
5
Number of revisions (in millions)
1
Active editors New editors Total revisions
Number of revisions (in millions)
5 10
15
Number of revisions (in millions)
4
CONTENT − Active/New editors and #revisions
Number of revisions (in millions)
0.10
12
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
0.15
2
Number of editors (in thousands)
USER − Active/New editors and #revisions
04
0.20
0
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
10 15
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
0.25
20
Number of revisions (in millions)
0.30
3
Number of editors (in thousands)
(a)
4
Number of revisions (in millions)
0
25
Active editors New editors Total revisions
Number of editors (in thousands)
Active editors New editors Total revisions − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
1
30
Number of revisions (in millions)
0
04
2
Number of revisions (in millions)
3
Number of editors (in thousands)
ALL − Active/New editors and #revisions
Number of editors (in thousands)
Active editors New editors Total revisions
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
0
04
5
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
10
Number of revisions (in millions)
Number of editors (in thousands) 15
Number of editors (in thousands)
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
04
4
Number of revisions (in millions)
Number of editors (in thousands) 20
Number of editors (in thousands)
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
04
5
Number of revisions (in millions)
Number of editors (in thousands) 25
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
20
7
Number of editors (in thousands)
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
15
6
Number of revisions (in millions)
Number of editors (in thousands) 6
Active editors New editors Total revisions
Number of editors (in thousands)
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
30
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
Number of editors (in thousands)
35
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
Rizoiu et al. Supplementary Material
TALK.C − Active/New editors and #revisions
(c)
WIKI − Active/New editors and #revisions
(f)
NATURE − Active/New editors and #revisions
(i)
HISTORY − Active/New editors and #revisions
(o)
10
20 0.0
2.0
20 1.5
15
10 1.0
5 0.5
0 0.0
TECHNOLOGY − Active/New editors and #revisions
Active editors New editors Total revisions 2.0
20 1.5
15 1.0
10
5 0.5
0 0.0
Active editors New editors Total revisions 0.6
15
10 0.4
5 0.2
0 0.0
Figure 5. (cont. of Fig. 4) The number of revisions added to Wikipedia is constantly descending, as well as the number of active editors and new editors. Showing the evolution, broken down by feature. (m)
POLITICS − Active/New editors and #revisions
25
25
20
0
(d)
Active editors New editors Total revisions
15
0
(g)
Active editors New editors Total revisions
15
0
(j) (k)
ENVIRONMENT − Active/New editors and #revisions PEOPLE − Active/New editors and #revisions
Active editors New editors Total revisions
15
5
0
(n)
30
2.0
1.5
15
10 1.0
5 0.5
0.0
EDUCATION − Active/New editors and #revisions
20 1.5
10 1.0
5 0.5
0.0
1.5
20
1.0
10
5 0.5
0.0
1.2
1.0
0.8
10 0.6
0.4
0.2
0.0
25
CHRONOLOGY − Active/New editors and #revisions
25
25
Number of editors (in thousands)
Active editors New editors Total revisions
(b)
BUSINESS − Active/New editors and #revisions
25
(e)
(h)
2.0
15 1.5
10 1.0
5
0
0.5
0.0
30
Active editors New editors Total revisions 2.5
20
2.0
15
10
1.5
5
1.0
0
0.5
0.0
20
Active editors New editors Total revisions 1.5
15
10
1.0
5 0.5
0 0.0
30
Active editors New editors Total revisions 2.5
20 2.0
15 1.5
10
5 1.0
0 0.5
0.0
30
20
Active editors New editors Total revisions 2.0
15 1.5
10
5 1.0
0.5
0 0.0
Number of revisions (in millions)
20
Number of revisions (in millions)
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
04
0.0
Number of revisions (in millions)
0.5
Number of revisions (in millions)
5
Number of editors (in thousands)
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
04
Number of revisions (in millions) Number of editors (in thousands) 10
25
Number of revisions (in millions)
0.5
1.0
30
Number of revisions (in millions)
5
GEOGRAPHY − Active/New editors and #revisions
Number of revisions (in millions)
1.0
20
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
10
Active editors New editors Total revisions
04
15 25
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
1.5
0
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
20 15
Number of revisions (in millions)
2.0
1.5
Number of editors (in thousands)
2.5
20
Number of revisions (in millions)
(a)
2.0
Number of editors (in thousands)
SOCIETY − Active/New editors and #revisions
Active editors New editors Total revisions
Number of revisions (in millions)
Active editors New editors Total revisions − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
0
04
Active editors New editors Total revisions
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
30
25
Number of editors (in thousands)
25 0.0
Number of revisions (in millions)
Number of editors (in thousands) 0
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
25 0.1
Number of editors (in thousands)
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
04
5
Number of revisions (in millions)
Number of editors (in thousands)
0.2
Number of editors (in thousands)
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
04
10
Number of revisions (in millions)
Number of editors (in thousands)
0.3
Number of editors (in thousands)
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
25
0.4
Number of revisions (in millions)
Number of editors (in thousands)
0.5
Number of editors (in thousands)
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
15
Active editors New editors Total revisions
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
Number of editors (in thousands)
MATHEMATICS − Active/New editors and #revisions
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
Rizoiu et al. Supplementary Material
HUMANITIES − Active/New editors and #revisions
(c)
LIFE − Active/New editors and #revisions
(f)
ARTS − Active/New editors and #revisions
(i)
CULTURE − Active/New editors and #revisions
(l)
APPLIED.SCIENCES − Active/New editors and #revisions
(o)
11
0.04
0.12
0.01
0.00
f m
0.008
0.006
0.004
0.002
0.000
f m
0.03
0.02
0.01
0.00
f m
0.10
0.08
0.06
0.04
0.02
0.00
Figure 6. Temporal evolution of mean basic and extended values of features, broken down by gender. All features are expressed as percentages of ALL.
(m) f m
0.015
0.010
0.010
0.008
0.006
0.005
0.004
0.002
(d)
gender − INFRA percentage of ALL
(g)
gender − LAW percentage of ALL
(j)
gender − SCIENCE percentage of ALL
0.08
Percentage out of all revisions
0.000
f m
0.015
0.010
0.005
f m
0.06
0.04
0.02
f m
0.06
0.04
0.02
0.00
(n)
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
0.00
04
Percentage out of all revisions
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
04
Percentage out of all revisions
0.05
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
0.020
Percentage out of all revisions
0.02 0.020
0.015
0.010
0.005
Percentage out of all revisions
0.10
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
gender − USER percentage of ALL
Percentage out of all revisions
0.03 0.15
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
(a)
Percentage out of all revisions
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
0.6
0.20
0.000
0.00
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
0.04
Percentage out of all revisions
f m
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
Percentage out of all revisions 1.0
f m
04
0.010
Percentage out of all revisions
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
f m
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
Percentage out of all revisions 0.05
Percentage out of all revisions
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
1.2
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
Percentage out of all revisions 0.8
Percentage out of all revisions
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
1.4
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
Percentage out of all revisions
gender − ALL percentage of ALL
04
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
04
Rizoiu et al. Supplementary Material
0.25
gender − CONTENT percentage of ALL 0.025
f m
gender − TALK.C percentage of ALL
0.000
(b) (c)
gender − TALK.U percentage of ALL 0.012
f m
gender − WIKI percentage of ALL
0.000
(e) (f)
gender − AGRICULTURE percentage of ALL 0.14
0.12
f m
gender − NATURE percentage of ALL
0.10
0.08
0.06
0.04
0.02
0.00
(h) (i)
gender − HEALTH percentage of ALL
0.10
f m
gender − HISTORY percentage of ALL
0.08
0.06
0.04
0.02
0.00
(k) (l)
gender − LANGUAGE percentage of ALL
0.05
f m
gender − BELIEF percentage of ALL
0.04
0.03
0.02
0.01
0.00
(o)
12
0.08
0.06
0.04
0.02
0.00
0.10
0.08
0.06
0.04
0.02
0.00
f m
0.08
0.06
0.04
0.02
0.00
f m
0.03
0.02
0.01
0.00
Figure 7. (cont. Fig 6) Temporal evolution of mean basic and extended values of features, broken down by gender. All features are expressed as percentages of ALL.
(m) 0.12
f m
0.10
0.08
0.06
0.04
0.02
(d)
gender − POLITICS percentage of ALL
(g)
gender − TECHNOLOGY percentage of ALL
(j)
gender − ENVIRONMENT percentage of ALL
0.06
Percentage out of all revisions
0.02
Percentage out of all revisions
0.04
f m
0.08
0.06
0.04
0.02
f m
0.08
0.06
0.04
0.02
f m
0.05
0.04
0.03
0.02
0.01
0.00
(n)
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
0.06
04
Percentage out of all revisions 0.08
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
gender − SOCIETY percentage of ALL
Percentage out of all revisions
0.10
0.10
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
(a)
Percentage out of all revisions
0.000
f m
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
0.12 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
0.005
04
Percentage out of all revisions 0.010
Percentage out of all revisions
0.10
0.12
0.00
0.00
0.00
0.10
0.00
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
f m
Percentage out of all revisions
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
04
0.015
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
Percentage out of all revisions
gender − MATHEMATICS percentage of ALL
04
f m
Percentage out of all revisions
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
0.020
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
Percentage out of all revisions 0.12
Percentage out of all revisions
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
0.025
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
Percentage out of all revisions 0.14
Percentage out of all revisions
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
f m
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
Percentage out of all revisions
0.030
04
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
04
Rizoiu et al. Supplementary Material
gender − GEOGRAPHY percentage of ALL 0.12
gender − HUMANITIES percentage of ALL f m
0.10
0.08
0.06
0.04
0.02
0.00
(b) (c)
gender − BUSINESS percentage of ALL 0.14
0.12
f m
gender − LIFE percentage of ALL
0.10
0.08
0.06
0.04
0.02
0.00
(e) (f)
0.10
gender − EDUCATION percentage of ALL
0.08
f m
gender − ARTS percentage of ALL
0.06
0.04
0.02
0.00
(h) (i)
gender − CHRONOLOGY percentage of ALL
0.14
0.12
f m
gender − CULTURE percentage of ALL
0.10
0.08
0.06
0.04
0.02
0.00
(k) (l)
0.07
gender − PEOPLE percentage of ALL
0.12
gender − APPLIED.SCIENCES percentage of ALL
0.10
f m
0.08
0.06
0.04
0.02
0.00
(o)
13
0.00
0.006
0.004
0.002
0.000
0.08
grads PhD undergrads
0.06
0.04
0.02
0.00
0.20
grads PhD undergrads
0.15
0.10
0.05
0.00
Figure 8. Temporal evolution of mean basic and extended values of features, broken down by education. All features are expressed as percentages of ALL.
(m) 0.025
grads PhD undergrads
0.020
0.015
0.015
0.010
0.005 0.010
0.005
(d)
education − INFRA percentage of ALL
0.04
(g)
education − LAW percentage of ALL
0.14
0.12
(j)
education − SCIENCE percentage of ALL
0.15
Percentage out of all revisions
0.05
Percentage out of all revisions
0.10
0.000
grads PhD undergrads
0.03
0.02
0.01
grads PhD undergrads
0.10
0.08
0.06
0.04
0.02
grads PhD undergrads
0.10
0.05
0.00
(n)
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
0.15
04
Percentage out of all revisions
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
04
Percentage out of all revisions
0.20
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
education − USER percentage of ALL
Percentage out of all revisions
0.01 0.25
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
(a)
Percentage out of all revisions
0.02
grads PhD undergrads
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
04
0.6
Percentage out of all revisions
0.03
0.30
0.00
0.00
0.00
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
0.04
Percentage out of all revisions
grads PhD undergrads
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
Percentage out of all revisions
education − ALL percentage of ALL
04
grads PhD undergrads
Percentage out of all revisions
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
1.0
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
Percentage out of all revisions 0.008
Percentage out of all revisions
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
1.2
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
Percentage out of all revisions 0.05
Percentage out of all revisions
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
0.8
grads PhD undergrads
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
Percentage out of all revisions 1.4
04
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
04
Rizoiu et al. Supplementary Material
education − CONTENT percentage of ALL education − TALK.C percentage of ALL
0.06
grads PhD undergrads
0.05
0.04
0.03
0.02
0.01
0.00
(b) (c)
education − TALK.U percentage of ALL
0.020
grads PhD undergrads
education − WIKI percentage of ALL
0.000
(e) (f)
0.010
education − AGRICULTURE percentage of ALL 0.20
education − NATURE percentage of ALL grads PhD undergrads
0.15
0.10
0.05
0.00
(h) (i)
education − HEALTH percentage of ALL education − HISTORY percentage of ALL
0.15
grads PhD undergrads
0.10
0.05
0.00
(k) (l)
education − LANGUAGE percentage of ALL
0.08
education − BELIEF percentage of ALL
grads PhD undergrads
0.06
0.04
0.02
0.00
(o)
14
0.05
0.10
0.05
0.00
0.15
grads PhD undergrads
0.10
0.05
0.00
grads PhD undergrads
0.04
0.03
0.02
0.01
0.00
Figure 9. (cont. Fig 8) Temporal evolution of mean basic and extended values of features, broken down by education. All features are expressed as percentages of ALL.
(m) 0.20
grads PhD undergrads
0.15
0.10
0.05
(d)
education − POLITICS percentage of ALL
0.15
(g)
education − TECHNOLOGY percentage of ALL
(j)
education − ENVIRONMENT percentage of ALL
0.08
Percentage out of all revisions
grads PhD undergrads
0.10
0.05
grads PhD undergrads
0.10
0.05
grads PhD undergrads
0.06
0.04
0.02
0.00
(n)
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
04
Percentage out of all revisions
Percentage out of all revisions
0.05
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
education − SOCIETY percentage of ALL
Percentage out of all revisions
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
04
Percentage out of all revisions
0.10
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
(a)
Percentage out of all revisions
0.00
grads PhD undergrads
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
0.05
Percentage out of all revisions
0.10
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
−2
10 00 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
04
0.00
0.15
Percentage out of all revisions
grads PhD undergrads
Percentage out of all revisions
Percentage out of all revisions 0.15
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
0.15
Percentage out of all revisions
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
grads PhD undergrads
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
Percentage out of all revisions 0.02
04
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
0.20
Percentage out of all revisions
Percentage out of all revisions
education − MATHEMATICS percentage of ALL
0.00
0.00
0.00
0.00
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
0.06
10 00 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
−2
04
0.04
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
Percentage out of all revisions 0.06
grads PhD undergrads
04
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
Rizoiu et al. Supplementary Material
education − GEOGRAPHY percentage of ALL 0.20
education − HUMANITIES percentage of ALL
education − BUSINESS percentage of ALL 0.20
education − EDUCATION percentage of ALL 0.12
education − CHRONOLOGY percentage of ALL
0.20
0.10
education − PEOPLE percentage of ALL
0.20
grads PhD undergrads
0.15
0.10
0.05
0.00
(b) (c)
grads PhD undergrads
education − LIFE percentage of ALL
0.15
0.10
0.05
0.00
(e) (f)
0.14
grads PhD undergrads
education − ARTS percentage of ALL
0.10
0.08
0.06
0.04
0.02
0.00
(h) (i)
education − CULTURE percentage of ALL grads PhD undergrads
0.15
0.10
0.05
0.00
(k) (l)
education − APPLIED.SCIENCES percentage of ALL
grads PhD undergrads
0.15
0.10
0.05
0.00
(o)
15
0.015
atheist christian jewish muslim
0.010
0.005
0.000
0.06
0.05
atheist christian jewish muslim
0.04
0.03
0.02
0.01
0.00
0.15
atheist christian jewish muslim
0.10
0.05
0.00
Figure 10. Temporal evolution of mean basic and extended values of features, broken down by religion. All features are expressed as percentages of ALL.
(m) 0.025
0.020
atheist christian jewish muslim
0.015
0.015
0.010
0.010
0.005
0.005
(d)
religion − INFRA percentage of ALL
0.020
(g)
religion − LAW percentage of ALL
0.10
0.08
(j)
religion − SCIENCE percentage of ALL
0.12
0.10
Percentage out of all revisions
0.000
atheist christian jewish muslim
0.015
0.010
0.005
atheist christian jewish muslim
0.06
0.04
0.02
atheist christian jewish muslim
0.08
0.06
0.04
0.02
0.00
(n)
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
04
Percentage out of all revisions
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
04
Percentage out of all revisions
Percentage out of all revisions
0.05
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
religion − USER percentage of ALL
Percentage out of all revisions
0.00 0.10
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
(a)
Percentage out of all revisions
0.01 0.15
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
04
0.6
Percentage out of all revisions
0.02 0.20
atheist christian jewish muslim
0.00
0.000
0.00
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
0.03
0.25
04
0.04
Percentage out of all revisions
atheist christian jewish muslim
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
Percentage out of all revisions 0.05
Percentage out of all revisions
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
1.0
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
Percentage out of all revisions 0.06
Percentage out of all revisions
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
0.8
atheist christian jewish muslim
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
Percentage out of all revisions 1.2
Percentage out of all revisions
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
1.4
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
Percentage out of all revisions
religion − ALL percentage of ALL
04
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
04
Rizoiu et al. Supplementary Material
religion − CONTENT percentage of ALL 0.04
atheist christian jewish muslim
religion − TALK.C percentage of ALL
0.03
0.02
0.01
0.00
(b) (c)
religion − TALK.U percentage of ALL 0.020
atheist christian jewish muslim
religion − WIKI percentage of ALL
0.000
(e) (f)
religion − AGRICULTURE percentage of ALL 0.15
atheist christian jewish muslim
religion − NATURE percentage of ALL
0.10
0.05
0.00
(h) (i)
religion − HEALTH percentage of ALL
0.12
0.10
atheist christian jewish muslim
religion − HISTORY percentage of ALL
0.08
0.06
0.04
0.02
0.00
(k) (l)
religion − LANGUAGE percentage of ALL
0.08
atheist christian jewish muslim
religion − BELIEF percentage of ALL
0.06
0.04
0.02
0.00
(o)
16
0.10
0.08
0.06
0.04
0.02
0.00
0.14
0.12
atheist christian jewish muslim
0.10
0.08
0.06
0.04
0.02
0.00
0.04
atheist christian jewish muslim
0.03
0.02
0.01
0.00
Figure 11. (cont. Fig 10) Temporal evolution of mean basic and extended values of features, broken down by religion. All features are expressed as percentages of ALL.
(m) 0.15
atheist christian jewish muslim
0.10
0.05
(d)
religion − POLITICS percentage of ALL
0.12
0.10
(g)
religion − TECHNOLOGY percentage of ALL
0.12
0.10
(j)
religion − ENVIRONMENT percentage of ALL
0.08
Percentage out of all revisions
Percentage out of all revisions
0.02
atheist christian jewish muslim
0.08
0.06
0.04
0.02
atheist christian jewish muslim
0.08
0.06
0.04
0.02
atheist christian jewish muslim
0.06
0.04
0.02
0.00
(n)
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
0.04
04
Percentage out of all revisions 0.06
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
religion − SOCIETY percentage of ALL
Percentage out of all revisions
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
Percentage out of all revisions
0.08
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
(a)
Percentage out of all revisions
0.00 0.10
atheist christian jewish muslim
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
0.05 04
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
04
0.00
0.12
Percentage out of all revisions
0.10
Percentage out of all revisions
atheist christian jewish muslim
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
Percentage out of all revisions
religion − MATHEMATICS percentage of ALL
0.00
0.00
0.00
0.00
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
0.12
atheist christian jewish muslim
Percentage out of all revisions
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
0.01
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
Percentage out of all revisions 0.14
Percentage out of all revisions
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
0.02
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
Percentage out of all revisions 0.15
Percentage out of all revisions
04 − 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
0.03
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
Percentage out of all revisions
0.04
atheist christian jewish muslim
04
− 10 200 − 2 04 200 − 2 10 200 − 3 04 200 − 3 10 200 − 4 04 200 − 4 10 200 − 5 04 200 − 5 10 200 − 6 04 200 − 6 10 200 − 7 04 200 − 7 10 200 − 8 04 200 − 8 10 200 − 9 04 200 − 9 10 201 − 0 04 201 − 0 10 201 − 1 04 201 − 1 10 201 − 2 04 201 −2 2 01 3
04
Rizoiu et al. Supplementary Material
0.14
religion − GEOGRAPHY percentage of ALL 0.15
religion − HUMANITIES percentage of ALL
religion − BUSINESS percentage of ALL 0.15
religion − EDUCATION percentage of ALL
0.10
religion − CHRONOLOGY percentage of ALL
0.15
religion − PEOPLE percentage of ALL
0.15
atheist christian jewish muslim
0.10
0.05
0.00
(b) (c)
atheist christian jewish muslim
religion − LIFE percentage of ALL
0.10
0.05
0.00
(e) (f)
0.14
atheist christian jewish muslim
religion − ARTS percentage of ALL
0.08
0.06
0.04
0.02
0.00
(h) (i)
religion − CULTURE percentage of ALL atheist christian jewish muslim
0.10
0.05
0.00
(k) (l)
religion − APPLIED.SCIENCES percentage of ALL
atheist christian jewish muslim
0.10
0.05
0.00
(o)
17
0.61
0.60
EqualPrecisionRecall
0.74 0.685
0.680
0.92
0.90
0.88
Figure 12. Evolution of prediction over time, measure using the Equal Precision Recall point. The lines represents the mean value and the error bars represent the standard deviation over 20 random split. In order from (a) to (h), the graphics present respectively gender, religion/atheist, religion/christian, religion/jewish, religion/muslim, education/undergrads, education/grads and education/PhD. The values for the rightmost timeframe are given in Table 2. (g)
EqualPrecisionRecall
0.690
(a)
Prediction of religion − jewish (basic features)
0.94
0.93
0.92
0.91
(d) Prediction of education − grads (basic features)
0.59
01−2007 04−2007 07−2007 10−2007 01−2008 04−2008 07−2008 10−2008 01−2009 04−2009 07−2009 10−2009 01−2010 04−2010 07−2010 10−2010 01−2011 04−2011 07−2011 10−2011 01−2012 04−2012 07−2012 10−2012 01−2013 04−2013 07−2013 10−2013
0.76
EqualPrecisionRecall
01−2007 04−2007 07−2007 10−2007 01−2008 04−2008 07−2008 10−2008 01−2009 04−2009 07−2009 10−2009 01−2010 04−2010 07−2010 10−2010 01−2011 04−2011 07−2011 10−2011 01−2012 04−2012 07−2012 10−2012 01−2013 04−2013 07−2013 10−2013
0.695
EqualPrecisionRecall
0.78
01−2007 04−2007 07−2007 10−2007 01−2008 04−2008 07−2008 10−2008 01−2009 04−2009 07−2009 10−2009 01−2010 04−2010 07−2010 10−2010 01−2011 04−2011 07−2011 10−2011 01−2012 04−2012 07−2012 10−2012 01−2013 04−2013 07−2013 10−2013
01−2007 04−2007 07−2007 10−2007 01−2008 04−2008 07−2008 10−2008 01−2009 04−2009 07−2009 10−2009 01−2010 04−2010 07−2010 10−2010 01−2011 04−2011 07−2011 10−2011 01−2012 04−2012 07−2012 10−2012 01−2013 04−2013 07−2013 10−2013
0.94
EqualPrecisionRecall
01−2007 04−2007 07−2007 10−2007 01−2008 04−2008 07−2008 10−2008 01−2009 04−2009 07−2009 10−2009 01−2010 04−2010 07−2010 10−2010 01−2011 04−2011 07−2011 10−2011 01−2012 04−2012 07−2012 10−2012 01−2013 04−2013 07−2013 10−2013
EqualPrecisionRecall
Prediction of gender − m (basic features)
01−2007 04−2007 07−2007 10−2007 01−2008 04−2008 07−2008 10−2008 01−2009 04−2009 07−2009 10−2009 01−2010 04−2010 07−2010 10−2010 01−2011 04−2011 07−2011 10−2011 01−2012 04−2012 07−2012 10−2012 01−2013 04−2013 07−2013 10−2013
01−2007 04−2007 07−2007 10−2007 01−2008 04−2008 07−2008 10−2008 01−2009 04−2009 07−2009 10−2009 01−2010 04−2010 07−2010 10−2010 01−2011 04−2011 07−2011 10−2011 01−2012 04−2012 07−2012 10−2012 01−2013 04−2013 07−2013 10−2013
EqualPrecisionRecall
01−2007 04−2007 07−2007 10−2007 01−2008 04−2008 07−2008 10−2008 01−2009 04−2009 07−2009 10−2009 01−2010 04−2010 07−2010 10−2010 01−2011 04−2011 07−2011 10−2011 01−2012 04−2012 07−2012 10−2012 01−2013 04−2013 07−2013 10−2013
EqualPrecisionRecall
Rizoiu et al. Supplementary Material
0.80 0.700
Prediction of religion − atheist (basic features) 1.1
Prediction of religion − christian (basic features)
1.0
0.675
0.96
0.95
0.9
0.8
0.72 0.7
0.6
0.70 0.5
(b) (c)
Prediction of religion − muslim (basic features) 0.55
Prediction of education − undergrads (basic features)
0.96 0.54
0.53
0.52
0.90 0.51
0.50
(e) (f)
0.62
Prediction of education − PhD (basic features)
0.22
0.20
0.18
0.16
0.14
0.12
(h)
18
Rizoiu et al.
Supplementary Material
Prediction of education - grads (basic features)
0.67
0.63
Prediction of education - PhD (basic features)
0.66 0.65
Area under ROC
Area under ROC
0.62
0.61
f(x) = 0.001x + 0.595 R² = 0.909 -6 p-val = 2.016 x 10
0.6
0.59
0.64 0.63 0.62 f(x) = 0.001x + 0.607 R² = 0.734 -2 p-val = 1.496 x 10
0.61 0.6 0.59
0.58
0.58
(a) 0.6
(b)
Prediction of religion - atheist (basic features)
0.54
Prediction of religion - Christian (basic features)
0.54
0.59
f(x) = 0.001x + 0.503 R² = 0.508 -12 p-val = 6.916 x 10
Area under ROC
Area under ROC
0.53 0.58 0.57 0.56 0.55 f(x) = 0.001x + 0.555 R² = 0.702 -3 p-val = 1.018 x 10
0.54
0.53 0.52 0.52 0.51 0.51 0.5
0.53
0.5
(c)
(d) Prediction of religion - Jewish (basic features) 0.6
Area under ROC
0.58
0.56 f(x) = 0.001x + 0.542 R² = 0.642 -1 p-val = 1.064 x 10
0.54
0.52
0.5
(e)
Figure 13. Results on the remaining binary predictors: Temporal evolution of the AUC, measuring privacy loss. Result of inferring, using binary predictors on the basic feature set, of education (graduates and PhD) and religion (atheist, christian and jewish).
0.520
Prediction of religion - Christian - users inactive after 01/2008
Period of activity
Area under ROC
0.515
0.510
0.505
0.500 Period of inactivity 0.495
Figure 14. Increase of prediction performance for for users retired after 01.2008 – religion/christian. 19