Measuring Wikipedia to understand the change and ...

groups with different behaviors: a) First group is composed by two Wikipages: Black. Hole and Marilyn Monroe, which have few editors with a medium size page, ...
Measuring Wikipedia to understand the change and bias of information. Echeverria-Arellano, A.P.º, Quartino-Ibarra, M.L.º , Yañez-Cordova, S.º and Sainz-de Agüero, J.J.º* ºInstituto Rosedal Lomas. L.C., Rosedal 50, Lomas de Chapultepec, Miguel Hidalgo, CDMX, México. CP 11000 *Corresponding author: [email protected]

Abstract We obtained data from 22 different Wikipedia pages regarding their size (bytes) and amount of editors involved from the last 50 editions. We analyzed this data in two big analyses: a) analysis focused on the behavior of information through time, and b) analysis focused on the dependance of variables regarding size and edition. We found evidence of gender, language, cultural and dead or alive personalities bias. This results show the impact of information flux and its consequences on digital citizenship. Key words: Wikipedia, Data analysys, social bias

Introduction Wikipedia is the most consulted multilingual, web-based and free-content and edition encyclopedia worldwide founded by Jimmy Donal Waves and Larry Sanger in 2001. Wikipedia costs around US$ 81.9 million based on Wikimedia donations. It is build in 298 languages and grows with wikipedians, witch are volunteer editors who write and edit articles. Any person can become an editor by making a change or edition that can improve the

article. There are currently 71,592,125 Wikipedia accounts, of which 300,765 are actively editing. For that reason, many people think that Wikipedia is not a reliable site, that it may contain false information or, simply a result of ignorance regarding how wikipedia works, therefore assuming that wikipedia is full of bias and untrustful information because it's not a “protected” information source. But since it was created, Wikipedia

This work is licensed under the Creative Commons Attribution -NonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.


has been administered by Cascading protection, that is used to prevent vandalism to particularly visible pages, such as the Main Page and a few very highly used templates. And also, the edition that any user made, can only be accepted by the administrators. Because of this, analyzing the historical editing data of Wikipedia is relevant for the next reasons: ● Wikipedia is commonly the first option for a lot of people for knowledge acquisition.

● It is also the encyclopedia with most authors and editors in all the world and history, witch it makes it a very diverse and interesting for data analysis. ● It is also the most diverse encyclopedia regarding languages, with more than 260 languages allowing almost every person to found information in his or her main language. ● Because of its structure and organization It is a very efficient website with specific information in every topic you search for in it.

Chart 1. Demographic information about Wikipedia Editors Editors Gender

Editors Nationality

Editors per language Wikipedia

Readers per language Wikipedia

Editors Age Distribution (Years old)

84% Male 16% Female

20% USA 12% Germany 7% Russia

76% English 20% German 12% Spanish

49% English 12% German 6% Spanish

13% are - 17. 14% are 18-21. 26% are 22-29 19% are 30-39. 28% are + 40

Method and Materials

To analyze different Wikipedia pages that present a variety of characteristics: different topics (celebrities, royalty, historical events, science, arts), gender, language (English and Spanish) and dead or alive personalities, regarding their page size, the amount of editors involved and the time these changes have happened.

We obtained 1100 data from 22 Wikipedia pages regarding their page size (bytes), type of edition, and name of the editor from the last 50 editions from the 22nd of March to the 31st of May. We created 9 graphs to analyze the data. From Graph 1 to 4 we analyzed the behavior of the pages by the Size of the page (bytes) and Time (number of edition). And from the graph 5 to 9 we made an analysis about the bias we found in language, gender, and dead and alive personalities. We obtained the standard deviation (SD) of all the WikiPages with the next equation:

Hypothesis If Wikipedia is the most consulted multilingual, web-based and free-content and edition encyclopedia worldwide, therefore a digital platform where a wide range of diverse world citizens actively participate, then the page size (bytes), the type of edition , and the amount of editors will behave the same in any language within dead or alive personalities, and without gender bias (female or male personalities Wiki Pages).

SD = size in "n" edition - average size of the editions

For this analysis we worked with Google Drive for paper writing (Google Docs 1.2018.24202) and chart construction (Google sheets 1.2018.24204). We built the graphs in Microsoft Excel 15.3.

Results Chart 2. Summary of the data collected for this analysis. Number of wikipedia pages analyzed


Type of data Categories Number of obtained from assigned to the editions used the wikipedia Wikipedia for the pages pages analysis Edited section or action, date of edition, editor’s name, and size of page (bytes).

Dead or alive, language, gender, celebrities.


Amount of data analyzed



Personality, Historical Time, Culture, Science, Art Movement, Dead and Alive Queens and Kings, Celebrities.

Data Analysis We have created 9 graphs analyzing different variables: time, size of the page in bytes, gender, language, dead or alive personalities, and number of editions. Analysis of Behavior: changes through time. In Graph 1. We can recognise three main categories: a) the Extraordniary Wikipages (color black), that includes the Wikipages that have the biggest size (from 170,000 to 250,000 bytes), b) the Regular group (color blue), includes the Wikipedia pages that have a regular size (from 70,000 to 140,000 bytes). And c) the lower group (color red), that includes the Wikipages with the lowest size (10,000 to 50,000 bytes) (See Graph 1.) The extraordinary group is composed by Princess Diana of Wales, World War II and Kate Winslet. The reasons of this huge pages might be that World War II was the most belic conflict of the 20th century, Diana Princess of Wales had a controversial death involving the Royal British Family and left a life legacy, and the actress Kate Winslet is probably because her movie roles and award nominations, like Titanic (1997) which won 11 Oscar awards, could be a reason having a Wikipedia page bigger than 150,000 bytes. (See Graph1.) By obtaining the standard deviation with the Equation 1. (see Method), in Graph 2. the Wikipedia pages are divided into two groups: the ones that are close to the mean (Normal), that have a constant change which

are not higher than 2,000 bytes and they smaller than - 2,000 bytes. The second group, the Aleatory group involves the Wikipedia pages that have an random and drastic change, splitting from the mean with more than 2,000 bytes. The Aleatory group is composed by the following Wikipedia pages: Kate Winslet, Cubism, Malala (english), Moctezuma II, Diana Princess of Wales, James Franco and Black Hole. Later, in Graph 3., we visualized the Aleatory group which ir composed by pages that have a drastic change in their size. These Wikipages are: Kate Winslet, Cubism, Malala (english), Moctezuma II, Diana, Princess of Wales, James Franco and Black Hole. This pages are consider aleatory because they have a drastic change of behavior, this means that they have an increase or decrease in the size of more than 2,000 byte from the mean (See Graph 3.). Cubism is an interesting case because it is getting smaller until it stops in the -2 000 bytes, then it has a constant size until the end of the graph when it goes down again to the -6,000 bytes from the mean. Finally, it recover to -2,000 bytes from the mean. On the other hand, the page of Moctezuma II stays in the mean, but in one edition it goes down to the -8,000 bytes from the mean, and in the next edition it recovers it size. Finally, in the time 45 it starts increasing away from the mean. There is no evidence of the physical world that contributes to the explanation of these behaviors (See Graph 3.).

Graph 1. Relationship between time and size of the page (bytes) 300,000

Extraordinary Wikipages


Size of Page (Bytes)


Regular Group



Lower Group 50,000

0 0






Graph 2. Change of Standar Deviation through time. 5,000



Standard deviation (Bytes)











Normal -7,000


Time (Number of editions)

Graph 3: Relantionship between the standard deviation and the time with only outstandig data Kate Winslet

6,000 4,000


Standard deviation

2,000 Malala (en)

0 0








-4,000 Diana, Princess of Wales

-6,000 -8,000

James Franco

-10,000 Black Hole

Time (number of editions)

Graph 4: The summation of bytes during a period of time with only the normal data Summation of bytes

1211000 1210000 1209000 1208000 1207000 1206000 1205000 1204000 1203000 0







We were not able to find evidence in the physical world about why all this changes happened suggesting the changes depend on the dynamics of editors and Wikipedia itself, meaning dynamics inside the encyclopedia and the digital world. Finally, in Graph 4. we view the summation of the size (bytes) of the Wikipages that have a regular behavior in the standard deviation (see Group 1, Graph 2.). This graph shows a change in size with no certain trajectory, suggesting that the changes in size of the regular pages doesn’t reflect any physical world effect. Analysis of Bias: relationship between variables In Graph 5. we recognize three different groups with different behaviors: a) First group is composed by two Wikipages: Black Hole and Marilyn Monroe, which have few editors with a medium size page, b) The second group which involves 4 Wikipages: World War II (English) and Segunda Guerra Mundial (Spanish), Diana Princess of Wales and Kate Winslet, which have the biggest Wikipage size of all three groups and a medium or high number of editors (more than 15 and less than 30 editors). Finally, c) Third group: composed by the majority of the Wikipages with a behavior of medium to high number of editors (15 to 35 editors) and a small to medium size of page. The topics of the Wikipages of the first group are science and celebrities; we find no explanation for this join behavior. The behavior of the second group probably reflects the historical value or popular culture of the events or personalities the Wikipage is about (See Analysis of Graph 1. for more details). The third group, is probably the way Wikipedia pages are

constructed: from 15 to 35 editors involved and a size no bigger than 130,000 bytes (See Graph 5.). By comparing languages in Graph 6., we recognize that English Wikipedia pages have more information than the Spanish ones. Most English Wikipages have a size difference with the Spanish ones bigger than 50,000 bytes. This difference is probably related to: the 20% of the Wikipedia editors are from USA, 76% of the main language of the editors of Wikipedia is English, and 49% of users read Wikipedia in English. There is only one exception: Culture of Mexico Wikipedia page (english) that groups with the Spanish Wikipedia pages, this suggest a cultural bias that is reflected in which type of information is consulted and edited, particularly information related to Mexico, a Spanish speaker country. (See Graph 6.) Apparently there is no editor’s preferences between dead or alive personalities regarding the Size of the page (bytes) and Time (number of edition) (See Graph 7.). Even that the size of the Wikipedia pages of the personalities analyzed varies, there is no evident pattern that suggest any bias regarding the size of the Wikipedia page. (See Graph 7). Graph 8. shows that there is a bigger number of editors in the pages of people who are alive and less in the pages of people who are dead. We do not have enough information for a clearer bias but we can suggest a pattern. By making three divisions in our graphs: left region (30 editors) we found: a) the distribution of alive

Graph 5. Relationship between the size and the amount of editors 300,000

Size of page (Bytes)






0 0









Number of Editors

Graph 6. Relationship between time and size of the page regarding language (spanish and english)


Size of page (Bytes)


Language: English



Language: Spanish







Time (Number of edition)



Graph 7. Relationship between Time and Size of the page regarding Death and Alive Personalities 300,000

Size of page (Bytes)


Alive Personalities




Death Personalities


0 0






Time (Number of edition)

Graph 8. Relationship between the number of editors and size (bytes) regarding death and alive personalities 300,000


Syze of page (Bytes)





0 15






Number of Editors

Graph 9. Relationship between the number of editors and size (bytes) regarding gender. 300,000

Size of page (Bytes)






0 0 5 the Creative 10 15 20 25 30 35 40 This work is licensed under Commons Attribution -NonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit Number of Editors send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.


Personalities’ Wikipages resides in the center and right region; meaning most of them have more than 25 editors. Regarding the distribution of b) dead personalities, we found the majority of Wikipages in the left region (less than 25 editors). This pattern probably emerges due to the creation of new information allowing the pages to change according to the news of the person alive, this doesn’t happen to dead personalities, and therefore, their Wikipedia pages aren’t updated as much as the alive ones resulting in fewer editors involved. This result doesn’t contradict the analysis of Graph 7. which suggest no bias regarding dead or alive personalities and their page size. In this analysis, the bias is regarding how many editors are involved in the construction of the Wikipedia page and apparently doesn’t affect their size, it might affect their quality. For closure of the Bias’ Analysis, we created Graph 9., which suggests a gender bias. The woman Wikipages are located from the left to the center region and the man Wikipages are located in the right region. This pattern can be explained by the percentage of male wikipedia editors (80% are men). There are two exceptions to this distribution: Gal Gadot and Moctezuma II. Probably Gal Gadot did not behaves the same way because she is consider very pretty, attractive and sexy, she also recently won an MTV Movie Awards, also she is the only woman not American who has interpret Wonder Woman. And in the case of Moctezuma it may be a case in which not a lot of people is interested in mexico's history and is an emperor with another language,

and he is already death so the information is not contemporary. We can conclude that our world is still a place where sexism is presented and we have to change that to have a world with equality. Conclusions 1. There are three groups regarding page size: The Extraordinary group (from 170,000 to 250,000 bytes), the Regular group (form 70,000 to 140,000), and the Lower group (from 10,000 to 50,000) (See graph 1). 2. The 3 top Wikipages have a big size because of their historic or popular value (See graph 1). 3. The Wikipedia pages have 2 different behaviors regarding the changes from the mean: the ones that are close to the mean (Regular group), and the ones that have an random and big change (Aleatory group). (See Graph 2. and Graph 3.). 4. The changes in the Aleatory group have no evident explanation from changes or influences of the physical world (See Graph 3.). 5. The Wiki pages that are categorized as normal du to its changes from the mean (Regular group), have no certain pattern in their change in size, also suggesting that the physical world doesn’t influence their behavior (See Graph 4.). 6. By comparing the size and number of editors we recognize three groups (see Graph 5.): • Group

1: Has few editors with a medium size page. No evidence that explains this behavior.

• Group

2: Has the biggest Wikipage size of all three groups and a medium or high number of editors (more than 15 and less than 30 editors). Reflects the historical value or popular culture of the events or personalities the Wikipage is about.

• Group 3: Composed by the majority of the Wikipages with a behavior of medium to high amount of editors (15 to 35 editors) and a small to medium size of page. This pages reflect the way Wikipedia pages are constructed. 7. Apparently there is a language bias related to the users and editors of Wikipedia, the 20% of the Wikipedia editors are from USA, 76% of the main language of the editors of Wikipedia is English, and 49% of users read Wikipedia in English. (See graph 6). 8. Apparently there might be a cultural bias regarding to Spanish speakers cultures: “The Culture of Mexico” (Wikipage in English) has the same size of the Wikipages in Spanish. (See Graph 6 and Graph 9.). 9. Apparently Wikipedia doesn't have any preferences between death and alive personalities, there is no pattern or evidence of bias regarding the size of the page through time (See graph 7), but we do find evidence that there might be a bias between dead or alive personalities and the amount of editors involved in the Wikipage: more editors are involved in the alive personalities Wikipages (See Graph 8.).

to have fewer editors involved than male ones (See Graph 9). Reflection Wikipedia is a worldwide encyclopedia with its own protection structure that allows it to be a reliable source. Besides that, it still shows evidence of sexism, and cultural and historical bias that requieres the active effort of every editor and user to be supress. Nevertheless, its capacity of constantly change and evolve is its primary characteristic that will assure its transformation into the biggest encyclopedia in history without gender, cultural or any other type of discrimination. Acknowledgements We want to thank Ana Marinez-Parente Doredo, for suggestind doing a standard deviation analysis for this project, to J. Jeronimo Sainz de Agüero for the revisions and advice at every step of this research. We also want to thank Antonio Echeverria Trujillo, Teresa Arellano Castañeda, Juan Quartino, Alejandra Ibarra, Carlos Yáñes Alegría, and Adriana Cordova Jaimes for all the financial support. To Héctor Ordóñez for allowing us to use the Innovation Hub at Instituto Rosedal Lomas for the develoment of this project, and for helping us with any technical problems we had. We would also like to thank Patricia Rodiguez Sada, Dulce Terrazas and Vanessa Ahedo for the support, advice, guidance and logistics.

