Exploring Activeness of Users in QA Forums - IEEE Computer Society

2 downloads 0 Views 390KB Size Report
Abstract—Success of a Q&A forum depends on volume of content (questions and answers) and quality of content (are the questions asked relevant, answers ...
Exploring Activeness of Users in QA Forums Vibha Singhal Sinha, Senthil Mani, and Monika Gupta IBM Research - New Delhi, India {vibha.sinha,sentmani,monikgup}@in.ibm.com

Abstract—Success of a Q&A forum depends on volume of content (questions and answers) and quality of content (are the questions asked relevant, answers provided correct etc). Community participation is essential to create and curate content. Since their inception in 2008, stack exchange based forums have been able to engage a large number of users to create a rich repository of good quality questions and answers. In this paper, we wish to investigate the “activeness” of users in the stackexchange network particularly from a perspective of content creation. We also attempt to measure how the forums’ incentive mechanism has enabled user’s activeness. Further, we investigate how user’s have diffused to other parts of the stack exchange network over time, hence bootstrapping new forums.

I. INTRODUCTION Ever since it started in 2008, Stack Overflow1 has become one of the most widely used question and answer site for programmers. The success of Stack Overflow motivated the founders to start Stack Exchange2 , a network of question and answer sites on diverse topics. Based on the common stack exchange framework, today this network covers 99 distinct topics varying from technical (android, serverfault) to nontechnical (photo, cooking). The success of any web based initiative, be it a Q&A forum or an open source project depends to a large degree on community participation. This participation could be in form of posting new content i.e. asking questions and answering or curation of existing content. To promote participation stack overflow has come up with a rewards mechanism where you earn badges and build a reputation. Non-registered users can view any posts and ask questions or give answers. All registered users automatically become part of the reputation award process, where based on their Q&A creation and curation activities they collect badges3 and build a reputation over time. Lotufo et. al [1] have investigated how the game mechanism used in Stack Overflow motivates contributors to increase contribution frequency and quality, by filtering useful contributions and by creating an agile and dependable moderation system. In [2] , the authors studied the Stack Overflow technical domains using both quantitative (statistical data analysis) and qualitative (user interviews) approaches and presented evidence of high visibility and active involvement of participants, as factors for success of the Q&A system. Stack exchange today boasts of 3 million registered users. However, how many of them are contributing to the forums in the essential 1 www.stackoverflow.com 2 www.stackexhchange.com 3 stackoverflow.com/badges

c 2013 IEEE 978-1-4673-2936-1/13/$31.00

activity of question and and answer creation versus how many are helping improve the content by voting, commenting etc.? Studies of open source projects like Apache and Mozilla ([3]) have shown that 80% of essential activity i.e. code development, be it features or bux fixing, is done by a small group of core-developers. In this paper we wish to investigate the participation of community in the essential activity of content creation in Stack Exchange fora. Specifically: • Q1: Is large amount of questioning & answering done by small number of people? • Q2: What is the longevity of users in these forums? • Q3: Is it possible to earn badges and build reputation without participating in asking questions or answering? • Q4: What is the user diffusion across different forums of Stack Exchange? II. DATA AND A PPROACH For this challenge, the organizers had provided data [4] from 37 stack exchange based forums. The data contained forum content and usage information till Jul’2012. We analyzed 36 of these forums. Stack overflow was not analyzed as it was orders of magnitude larger than other forums in terms of posts as well as users. For each of the forums, we were provided data on all the posts made in the forum; whether it was a question or answer, who made the post, when. For every user, information was provided on when the user registered, last login, name and hashed email id. There was also information provided on badges earned per user and his(her) reputation. We collated all this information per user per forum. Specifically, for every user in a particular forum, we had following information: • Count of Questions asked, Answers provided, Answers accepted and Questions not answered. • Name and Email hash. • Age calculated in days by differencing the date of last login and date of creation in forum. A. Summary Statistics of Forums Across 36 forums 1.3 million questions have been asked, around 1.3 million answers provided (some questions might not have been answered, some questions might have multiple answers). 0.4 million os these answers have been marked accepted. The are 0.33 million registered users, 0.7 million user instances4 . Only 951 questions and 482 answers have 4 A user might have registered for more than one forum. If the email hash of a user is same across forums, we counted it as one user, however the user instances would be counted as 3 assuming (s)he signed up for 3 forums.

77

MSR 2013, San Francisco, CA, USA

Answers

Questions

Users

Fig. 3. User age distribution. Age calculated has a difference between last login date and user registration date Users

Fig. 1. Log-log Scale Distribution of Number of Activities (Y-axis) performed by Users (X-axis)

TABLE II S UMMARY OF L INEAR R EGRESSION M ODEL : HOW Q&A ACTIVITY RELATES TO NUMBERS OF BADGES EARNED AND REPUTATION SCORE .A LL VALUES SIGNIFICANT WITH P < 0.01.

TABLE I U SER DISTRIBUTION BASED ON NUMBER OF QUESTIONS ASKED AND ANSWERS PROVIDED . # of activities (Q or A) >=1000 500–999 100-499 10-99 2–9 1 0

# of Users posting Qusestions (Q) Answers (A) 10 (0.003%) 93 (0.028%) 82 (0.025%) 193 (0.58%) 1818 (0.54%) 2114 (0.63%) 24575 (7.35%) 14975 (4.48%) 52011 (15.5%) 39214 (11.7%) 73450 (22%) 68872 (20.6%) 182485 (54.6%) 208970 (62.5%)

been provided by non-registered users, which is miniscule compared to overall number of Q&A. Only 3 forums had less than 10000 questions asked. 7 forums had 50000+ questions. Only 5% questions did not receive even a single answer. In terms of users, only 3 forums had less than 5000 registered users. 19 forums had 10k+ users and 5 forums had 50K+ users. New user registrations showed a significant increase over time. From Mar’09 to Jul’10 there were 30 with only 19% of them having contributed to the core activity of question and answers. This preliminary analysis shows that retaining users in forums and enticing them to actively contribute to the coreactivities needs improvement. V. Q3: I S IT POSSIBLE TO EARN BADGES AND BUILD REPUTATION WITHOUT PARTICIPATING IN ASKING QUESTIONS OR ANSWERING ?

To measure how badges and reputation score relate to Q&A activity, we constructed a linear regression model of number of questions and number of answers against count of badges and reputation score separately (Table II). The sign of each coefficient is its direction of correlation with the determination of badges and reputation. The Adjusted R2 lists the fraction of variance explained by the models. Both number of questions asked and answers provided significantly influence the badge count and reputation score. The relation is stronger with reputation score. This is as expected; more Q&A activity means higher reputation score and badge count.

78

Fig. 2. Distribution of percentage of users based on their age in the forum. Users whose last login date equal to their registration date have age = 0. Users for whom the difference between their last login date and creation date 30 days are also shown as two different segments. We considered users whose creation date was before July 2012.

Fig. 4. For each forum investigated: % of users who have not submitted any question (zero question), any answer (zero answer), any question and answer (zero question and answer) and no badges (zero badges).

Fig. 5. For each forum investigated: % of users who have not made a single question and answer contribution and have earned rewards in form of badges or reputation.

However, considering they only explain part of the variance, confirms the fact that there are other curation activities that also impact the rewards user collect. Intuitively, given that Stack Exchange is a Q&A forum, the awarding process should be most rewarding to content creation activities of asking questions and providing answers. People doing only curation activities should be less in number and should not have high number of badges or high reputation scores. Figure 4 plots the percentage of users who did not participate in Q&A along with users who have not yet earned any badges. If the data corresponds to our intuition, then the percentage of users having zero badges should be equal to percentage of users who did not post any question or answer. However, in 32 out of 36 forums, percentage of users who have contributed zero question or answer is higher than people with no badges or reputation. As high as 53% users in forums like rpg, bicycles, judiasm, scifi have not asked a single question but still have badges. This motivated us to further investigate these participants with zero Q&A activity (Figure 5). As is evident, we do find

considerable number of people across forums who have earned rewards without contributing to content creation. Additionally, we also found a small count of users who in spite of having zero Q&A activity had greater than five badges (e.g. 59 such users in programmers) or 500 reputation points (e.g. 11 such users in programmers). In stack exchange people might earn a badge for very simple activities such as completing one’s own profile. Hence, it is not surprising that we see users with zero Q&A activity had minimum one badge. However, what is surprising is that many users participating in non-core actitvities have greater than one badge and/or high reputation score (between 100 to 500). Interestingly, we also found instances of the converse where users posted questions or answers in a forum, but still had zero badges and reputation. Among user instances who had zero badges, 85K had posted questions, 74K instances had posted answers and 24K instances had posted both Q&As. Investigating the root cause of this can be considered for future work. From the analysed data, it looks like the rewards process can be further fine tuned to motivate people to participate in

79

Fig. 6. Graph shows the percentage distribution of users for each forum. For each forum Source represents the users who joined this forum as their first forum, Later represent the users who joined Stack Exchange after this forum had come into existence and then joined this forum and finally Prior represent the users who were already present in the Stack Exchange platform and then joined this forum once the forum came into existence. The forums are arranged chronologically (from left to right) based on their creation date. TABLE III TABLE PROVIDES DATA ON COUNT OF USERS WHO HAVE REGISTERED ACROSS MULTIPLE FORUMS

Number of Forums 1 2 to 5 5 to 10 10 to 35 All 36

Number of Users 23375 ( 69.8%) 76654 ( 22.9%) 17089 (5%) 7048 (2.3%) 39

TABLE IV F ORUMS OVERLAPPING WITH 50% OF THEIR USER BASE programmers unix electronics, programmers, english, android, apple cstheory, ux, scifi, rpg photo, cooking, webapps, dba, bicycles, security, webmasters diy, skeptics, stackapps

superuser serverfault, superuser superuser programmers, superuser programmers, serverfault, superuser programmers, meta, serverfault, superuser

the core activity of asking questions and providing answers. VI. Q4: W HAT IS THE USER DIFFUSION ACROSS DIFFERENT FORUMS OF S TACK E XCHANGE ? Table III present the count of users who have registered across forums. 70% of users participate only in one forum, while 22% participate in maximum 5 forums. Only a very few people (9%) participate in greater then 5 forums. Hence, only 30% users have diffused to other parts of stack exchange network after joining their forum of interest. Once a forum is created, it would attract three type of users: (1) people who existed in the network before the forum and joined this forum to start contributing, (2) people who joined the network specifically to be part of this forum, (3) people who joined the network at some later date for some other forum but also ended up joining this forum. Figure 6 presents the distribution of users in each forum in terms of Source (users joined this forum directly), Later (users came into existence post the forum creation) and Prior (users existed before the forum was created). Prior and Later combined give us an idea of diffusion into each forum from other parts in the network. It is as low as 20% for forums like gis, askubuntu and drupal and as high as 80% for forums like

stackapps, unix, security and skeptics. Atleast 20% of users in each of the forums joined the network to be part of that topic. Table IV, presents forums that have more than 50% overlap of their user base with another forum. Interestingly, 20 forums show overlap with only these three forums; superuser, meta, serverfault (all of these were the among the first few forums to be created) and programmers, which was created much later already shares >50% of the users with superuser. To summarize, we do see diffusion of users happening across the stack-exchange network. However, there is considerable amount of users’ joining in for each forum and joining other forums later. The analysis indicates that overall only 30% of users participate across forums and users move across forums, when either new forums are created or as users get interested in existing forums. Overall, the movement of user across forums does not seem to follow any specific pattern (time line based), but this user diffusion for further deeper analysis can be considered for future work. VII. C ONCLUSION We analyzed 36 Stack Exchange forums to investigate user activeness in terms of posting question and answers, longevity in forums and impact of badges and reputation score. We found that similar to other open source projects like Apache and Mozilla, there exist a small set of core participants who are responsible for the bulk of the core activities. The award processes can be further fine-tuned to incentivize content creation activities. 70% of users have stuck to only one forum in the overall stack exchange network. At a forum level, 20 to 80% users might have diffused from other parts of the network. R EFERENCES [1] R. Lotufo, L. Passos, and K. Czarnecki, “Towards improving bug tracking systems with game mechanisms,” in Mining Software Repositories (MSR), 2012 9th IEEE Working Conference on, june 2012, pp. 2 –11. [2] L. Mamykina, B. Manoim, M. Mittal, G. Hripcsak, and B. Hartmann, “Design lessons from the fastest q&a site in the west,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, ser. CHI ’11. New York, NY, USA: ACM, 2011, pp. 2857–2866. [3] A. Mockus, R. T. Fielding, and J. D. Herbsleb, “Two case studies of open source software development: Apache and mozilla,” ACM Trans. Softw. Eng. Methodol., vol. 11, no. 3, pp. 309–346, Jul. 2002. [4] A. Bacchelli, “Mining challenge 2013: Stack overflow,” in The 10th Working Conference on Mining Software Repositories, 2013, p. to appear.

80