Bouillon: a wiki-wiki social web - CiteSeerX

Bouillon: a wiki-wiki social web Victor Grishchenko Institute of Physics and Applied Mathematics Ural State University 51 Lenina st. Yekaterinburg, Russia [email protected]

Abstract. Bouillon project implements the vision shorthanded as a “Social Web”. It is an extremely open collaboration environment employing social links for creation, filtering and dissemination of information. It is also an attempt to boost the wiki effect of knowledge crystallization. Currently, the project is in stage of public testing, available at http://occo.org.

1 1.1

Introduction General theory

Many limitations of current online environments have their roots in the problem of trust. The author’s general theory is that every information exchange environment has two key parameters: openness and (quality) control. Those parameters have to be balanced. Over-controlled environment has problems accommodating new information and thus becomes poor. Too open environment becomes polluted. These problems are generic and exist independently of a particular medium (paper, voice or bytes). This simple theory is further illustrated with some examples. Although openness and control seem to be contradicting objectives, the author’s point is that more advanced environments combine more openness and more control. E.g. Wikipedia’s success compared to Britannica [13] may be explained by the former being much more open still keeping the level of control comparable to the latter. The objective of the Bouillon project is to create an online environment having a bit more of openness and control than those currently existing. 1.2

On wiki scalability

Original WikiWikiWeb [2] ideas targeted minimum-effort collaboration leading to information “automagically” crystallizing. Wiki is supposed to be a single medium for reading, collaborative editing, discussions and conversations. In many cases, “crystallization” appeared to be a more effective way of putting information together than the “glue” that search engines provide for WWW. Recognition of the wiki way’s advantages is a Wikipedia article being the first

entry for many Google searches on appropriate topics (if you still have no habit of checking Wikipedia first). Do wiki face any challenges? Or, what might be more open than a wiki? Any information environment have faced the scalability challenge; while some crisises lead to the environment’s decay (such as the Usenet’s Eternal September), other crisises were answered by a major breakthrough (such as Google’s PageRank). Wikis are increasingly popular; Wikipedia accommodates exponentially growing content without any visible degradation of its quality. Obviously, extreme openness (any visitor can edit) leads to regular abuses by spambots and vandals. Many today’s smaller wikis have to employ user accounts and passwords. This shifts their utility closer to ordinary web pages. Wikipedia, as an Internet-scale project, has some advantages in this regard. Namely, Wikipedia may have exhaustive statistics on strangers’ behavior. The great attention resource of Wikipedia gives a unique possibility to remove vandalism and spam by manual labor. Still, the scope of the project is limited to encyclopedic factual information. Working with personal opinions, experiences and discussions, commercial or local information, manuals and guides etc etc are all beyond possibilities of Wikipedia’s technology. So, wiki openness and scalability have visible borders.

2 2.1

State of the field Social ∩ Web

One path for improving wiki scalability was proposed by the original wiki inventor Ward Cunningham [12] in 1997. The “folk memory” approach assumes more social, peer-to-peer way of new material creation, filtration and dissemination – much like “folk tales or folk songs are remembered and propagated within a culture”. Indeed, if it works in the real world, in real social networks, why wouldn’t it work automatically online? Another concept, the “Social Web”, although rather fuzzy, seems relevant to our case. Wikipedia defines Social Web as “an open global distributed data sharing network” that “links people, organizations, and concepts”. One attempted step in practical implementation of the Social Web is Augmented Social Network Initiative [16]. The initiative took some effort in introducing online social interaction standards based on OASIS XDI, but produced no usable outcome. One more relevant ongoing effort is the NEPOMUK project (Networked Environment for Personalized, Ontology-based Management of Unified Knowledge [4]) which is a part of the Semantic Web movement. Wiki-related activities of NEPOMUK include creation and deployment of semantically annotated wikis [17]. 2.2

Topology

This work follows the Folk memory vision of information propagation in a social network. Because of this, topological features of social networks are extremely important for this work. What do we know about social topology?

One recent breakthrough in this area was associated with the scale-free networks theory [8]. The most characteristic feature of scale-free networks is powerlaw distribution of node degree P (k) k −γ , 2 < γ < 3, where P (k) is a probability of node degree being equal to k. Two results on scale-free networks are especially relevant in the context of this paper. First result is that scale-free networks have no epidemic threshold. Less prevalent memes (genes, pathogens, rumors) do not die out exponentially fast, albeit they are present in smaller populations [9]. An immediate conclusion is that even some less-demanded information will travel through a friend-to-friend network of such topology. To put it simply: long-tail content is welcomed. Another result is that scale-free networks are ultrasmall. A scale-free network has a diameter of ∼ log log N , where N is the number of nodes. This fact is informally known six degrees of separation hypothesis, originally attributed to S. Milgram. Ultrasmall diameter guarantees that a network may undergo exponential growth without significant topology changes, i.e. it is of ultimate scalability. Also, this gives us good information reachability guarantees, see 3.1. 2.3

Related research

Some recent advances are not directly related ot the subject of this paper, but heavily intersect with it, so worth to be mentioned briefly. Some previous work was focused on DHT-style distributed wiki hosting [22], that is supposed to change the technical aspect leaving collaboration aspects of wiki(pedia) intact. Steps were taken by different entities to introduce some Wiki Interchange Format, mostly for offline wiki synchronization, although the author is unaware of any widely accepted standard in this area. There are numerous commercial and open-source collaborative real-time editors letting a limited number of participants to craft a common document. Notable products of this kind include SubEthaEdit [5] for Mac OS X, open-source editor ACE [1] and others. A web-based collaborative editor SynchroEdit [6] may also be used as a wiki editor. During last five years, the internet showed explosive growth of online social networking sites, such as LinkedIn, LiveJournal or MySpace. Extensive work was undertaken to optimize peer-to-peer query flood [10,23,18]. That approach was originally used in the Gnutella P2P network [3], as well as in some later projects, and it is considered to be computationally inefficient. The main concern was exponential growth of the amount of queries compared to the number of users [21]. Bouillon’s objective is to turn this exponential growth weakness into a strength, see Sec. 3.1. Sec. 3.2 specially addresses complexity issues of Bouillon’s use of query flood.

3

Bouillon

Bouillon project aims to create a real-time WYSIWYG wiki without any distinction between reading and editing modes. The extreme openness is balanced

by the employed information filtering and dissemination process: any changes and opinions propagate from friend to friend, so content is sieved by the social network. This is much along the Folk Memory and Social Web visions mentioned earlier. As in any wiki, every Bouillon page is identified by its name. Every Bouillon page is a tree of pieces (think of paragraphs). Every piece has a partially ordered set of versions. To chose among different versions and to glue separate pieces into a tree Bouillon employs opinions. Opinions can be of two kinds: either “a/b” claiming relevance of piece b as a kid of piece a, or “b : v” claiming relevance of particular version v of a piece a. Opinions could be both positive and negative. As it was mentioned, any information, opinions, requests and pieces are passed from friend to a friend; the network has social topology. Forming links (contacts) are weighted with reputation in both directions. Reputations are supposed to be auto calculated based on past performance/compliance of one respective peer in the eyes of another. All opinions are weighted according to reputation distance to the author. Opinion calculations are made according to a formal model based on fuzzy logic which is explained in [15] and in more detail in [24]. To retrieve opinions from the social network peers use requests of two types. Root request /a asks for anyone who knows about the piece a; kid request a/ retrieves a/b and a : v kind of opinions. Correspondingly, opinion retrieval goes in two phases. Root request floods vicinity of a peer to detect every promising direction where something is known on the target page. Such a flood covers just a sublinear amount of nodes (compared to the overall size of the network). The following recursive retrieval is destined to promising directions only; as the process advances deeper into the tree, the list of promising directions shrinks, until the whole tree of reachable material is retrieved. After all relevant opinions are retrieved, the client assembles the page using the most recommended/fresh versions of accepted pieces and finally shows it to the user. Any further edits introduced by the user are committed as opinions and new versions of pieces. That edits immediately become available to the author’s social vicinity. On silent approval by any other reading peer, new changes propagate further by the social network. 3.1

Reachability

It is not obvious whether information will propagate in such a network fast or the process will stuck e.g. the network may separate into islands of incompatible changes. There is a simple topological result: Lemma 1. IF an average peer is able to serve its sublinear neighborhood (i.e. N a nodes, assuming N to be the size of the network) THEN any information or change that has propagated to a corresponding sublinear proportion of nodes (N 1−a ) is now broadly accessible (i.e. the majority of peers may retrieve it by requests).

Proof. (For Erdos-Rehyi random graphs.) Both nodes storing the information and nodes covered by the request are assumed to be random node subsets of size O(N 1−a ) and O(N a ) respectively. Thus, size of the intersection of two sets 1−a a has an order of pqN = NN × NN × N = 1, where p and q are probabilities for a node to belong to the respective set. (Actually, we get a Bernoulli trial having success probability pq.) Mean intersection size of 1 may seem unreliable; this is easily fixed by multiplying size of either set by some constant factor c. Deviations from the mean follow binomial distribution, so this gives us success (request meets the information) in most of the cases. t u For scale-free graphs, the situation is even better. In a scale-free graph, highlyconnected nodes (hubs) appear early in a vicinity ball growing from some center [11]. So, both sets of nodes that store the information and those covered by the request are supposed to heavily intersect with a third set of hubs (which also has sublinear size [14]). This boosts the effect described in the previous lemma. What about information that gained no sublinear “prevalence”? First, most of information is of local value and is not supposed to spread globally. People’s interests are supposed to correlate with social links, so such local information will likely reach all of its audience without disturbing the rest. Because of the absence of epidemic threshold (Sec. 2.2) it is generally expected that no information would die out if it is of some interest to anyone. The six degrees of separation hypothesis contributes another optimistic expectation. If a vicinity ball covered by requests has a radius of two steps (degrees), then from any point of humankind to any other point a piece of information might ideally be relayed by two intermediate evaluations. 3.2

Computational load

Bouillon protocol has some resource-hungry features. First, the initial root request (vicinity flood) that must cover some significant sublinear proportion of nodes. Second, a separate request corresponds to each piece of a page. Third, each request has to be refreshed every 50 seconds to support real-time change notification feature. Lets estimate bandwidth requirements for a peer in a network of 106 nodes, assuming node vicinity to be 103 nodes, average message size to be 100 − 200 bytes, 10 − 20 bytes compressed. (As tokens stream by long-living TCP connections, network overhead is skipped.) Average page size is 100 pieces and all peers are simultaneously reading different pages (no aggregation is possible). All peers reside at different hosts, no traffic is local. In this worst-case scenario, an average node has to process an order of 103 × 100 requests every 50 seconds, i.e. 2 ∗ 103 messages per second or 4 ∗ 104 bytes per second. Finally, we get 320Kbit/s, less than a typical BitTorrent download speed. We may also suppose that high clustering coefficient [19,20] of social networks may lead to unnecessary message duplication. (Clustering coefficient is the probability of two random friends of a random node being friends to each other.) Let the clustering coefficient be as high as 0.9. Simplistically, this leads

to the effect of 90% requests being passed to a node that already has a copy, so to cover 103 nodes 104 request copies are needed. Thus, the pessimistically estimated bandwidth consumption jumps to 3.2 M bit/s, which still fits into an average ADSL connection. (Needless to say, some datacenter-based deployment variants are always possible, users do not spend online 24 hours a day, some users read the same pages, some peers know nothing about a page, so they get no requests except for the root request, also clustering effect is less relevant for the outer layer of a vicinity ball, which is the most of the ball’s mass, etc etc.)

4

Conclusion

A wiki-wiki prototype of a Social-Web was implemented in 1200+ lines of Java code for the engine plus 200 lines of original JavaScript code for the client (WYSIWYG editing is implemented by TinyMCE [7]). The technology is supposed to scale very well, combining extreme openness and reasonable quality control for massively distributed collaboration. The prototype is always available at http://oc-co.org.

References 1. Ace collaborative text editor. http://ace.sf.net. 2. Article on wikis at c2.com. http://c2.com/cgi/wiki?WikiWikiWeb. 3. The gnutella protocol specification v0.4. http://www9.limewire.com/developer/ gnutella_protocol_0.4.pdf. 4. Nepomuk, the social semantic desktop project homepage. nepomuk.semanticdesktop.org/. 5. Subethaedit product page. http://www.codingmonkeys.de/subethaedit/. 6. Synchroedit collaborative editor product page. http://www.synchroedit.com/. 7. Tinymce javascript content editor. http://tinymce.moxiecode.com. 8. Albert-Laszlo Barabasi and Reka Albert. Emergence of scaling in random networks. Science, 286(5439):509–512, October 1999. 9. Marian Boguna, Romualdo Pastor-Satorras, and Alessandro Vespignani. Absence of epidemic threshold in scale-free networks with connectivity correlations. http://www.citebase.org/abstract?id=oai:arXiv.org:cond-mat/0208163, 2002. 10. Yatin Chawathe, Sylvia Ratnasamy, Lee Breslau, Nick Lanham, and Scott Shenker. Making gnutella-like p2p systems scalable, 2003. 11. Reuven Cohen and Shlomo Havlin. Scale-free networks are ultrasmall. Phys. Rev. Lett., 90:058701, 2003. 12. Ward Cunningham. Folk memory: A minimalist architecture for adaptive federation of object servers. http://c2.com/doc/FolkMemory.pdf, 1997. 13. Jim Giles. Internet encyclopaedias go head to head. Nature, 438:900–901, Dec 15 2005. 14. Victor Grishchenko. Computational complexity of one reputation metric. In Proceedings of SECURECOMM SECOVAL workshop, Athens, Greece, 2005.

15. Viktor S. Grishchenko. Redefining web-of-trust: reputation, recommendations, responsibility and trust among peers. In Proceedings of the First Workshop on Friend of a Friend, Social Networking and the Semantic Web, pages 75–84. National University of Ireland, Galway, Sep 2004. 16. Ken Jordan, Jan Hauser, and Steven Foster. The augmented social network: Building identity and trust into the next-generation internet. http://www.firstmonday.dk/issues/issue8_8/jordan/, 2003. 17. M. Kotelnikov, A. Polonsky, M. Kiesel, M. Volkel, and M. Sogrin et al. Nepomuk. interactive semantic wikis. http://nepomuk.semanticdesktop.org/xwiki/bin/view/Main1/D1-1, 2007. 18. Tsungnan Lin, Hsinping Wang, and Jianming Wang. Search performance analysis and robust search algorithm in unstructured peer-to-peer networks. pages 346–354, 2004. 19. M. E. J. Newman. Scientific collaboration networks. I. network construction and fundamental results. Phys. Rev. E, 64(1), 2001. 20. M. E. J. Newman. Scientific collaboration networks. II. clustering and preferential attachment in growing networks. Phys. Rev. E, 64(2), 2001. 21. Jordan Ritter. Why gnutella can’t scale. no, really. http://www.darkridge.com/ jpr5/doc/gnutella.html, 2001. 22. Guido Urdaneta, Guillaume Pierre, and Maarten van Steen. A decentralized wiki engine for collaborative wikipedia hosting. In Proceedings of the 3rd International Conference on Web Information Systems and Technologies, March 2007. http://www.globule.org/publi/DWECWH_webist2007.html. 23. B. Yang and H. Garcia-Molina. Improving search in peer-to-peer networks. pages 5–14, 2002. 24. Виктор Грищенко. Исчисление мнений. Известия Уральского государственного университета. Серия Компьютерные науки и информационные технологии, (43):139–153, 2006.