Toward a Mixed-Initiative QA system

Preprint manuscript - G Convertino, M Zancanaro, T Piccardi, F Ortega (2017) Toward a mixed-initiative QA system: from studying predictors in Stack Exchange to building a mixed-initiative tool. International Journal of Human-Computer Studies, 99

Toward a Mixed-Initiative QA system From Studying Predictors in Stack Exchange to Building a Mixed-Initiative Tool Gregorio Convertino (1,2), Massimo Zancanaro (3,1), Tiziano Piccardi (1, 4), Felipe Ortega (5) Xerox Research Centre Europe (1); Informatica Corporation (2); Fondazione Bruno Kessler(3); University of Trento (4); University Rey Juan Carlos (5) Corresponding author: Gregorio Convertino Informatica Corporation, 2100 Seaport Boulevard, Redwood City, CA 94063, USA [email protected]

ABSTRACT This article envisions a new customer support solution that merges the efficiency of crowd-based Question and Answer (QA) sites with the effectiveness of traditional customer care services. QA sites use crowdsourcing to solve problems in a very efficient way and they represent a new approach that can compete with traditional customer support services. Despite the remarkable efficiency of popular QA sites, if a question is not solved almost immediately, the chances are that it will not be solved soon or perhaps ever. This article provides evidence of a consistent Dark Side, a group of questions that remain unsatisfied or are satisfied very late, in eight popular QA sites on Stack Exchange. About 25-30% of all the questions in these sites fall into this Dark Side group. The findings show that predicting if a question will end up in the Dark Side is feasible, although with some approximation, without relying on content features. On the basis of this evidence, the article first presents and tests a model to predict the Dark Side and then presents a proof-of-concept of a mixed-initiative tool that helps a crowd-manager to decide whether an incoming question will be solved by the crowd or it should be redirected to a dedicated operator. Multiple evaluations of the proposed tool are reported. Finally, it concludes with lessons for the design and management of future QA platforms.

1


CONTENTS 1. INTRODUCTION 1.1 The customer support use case for QA sites 1.2 The problem with current QA systems 1.3 Organization of the article 2. ENVISIONED ARCHITECTURE 3. RELATED WORK 3.1 Precursors of QA sites: Knowledge Management systems 3.2 Studies of QA sites 3.3 Automatic predictions for QA tools 3.4 Mixed-initiative tools and user interfaces for sensemaking 3.5 Comparison to our own work 4. THE DARK SIDE OF QA SITES: ANALYTICS STUDY 4.1 Dataset: data from eight Stack Exchange QA sites over two years 4.2 Sizing up the Dark Side 4.3 Qualitative analysis of questions in the Dark Side 5. PREDICTING THE DARK SIDE 5.1 Logistic regression model 5.2 Testing the logistic regression model 5.2.1 Quantitative analysis of prediction accuracy 5.2.2 Qualitative analysis of misclassifications 5.3 Testing human prediction 6. THE MIXED-INITIATIVE QA SYSTEM 6.1 From analytics results to design objectives: the mixed-initiative approach 6.2 The user interface: a dashboard for the crowd manager 2


6.3 Experiment with Stack Overflow experienced users 6.3.1 Hypotheses and method 6.3.2 Results 6.4 Mixed-initiative vs. naïve approaches: comparing costs 7. DISCUSSION AND CONCLUSION 7.1 Lessons for future QA platforms 7.1.1 A significant portion of questions is likely to remain unsatisfied 7.1.2 A mixed-initiative tool can indeed help a crowd manager 7.1.3 Opportunities for improving performance 7.1.4 Moving beyond a one-model-fits-all approach to prediction 7.1.5 A Promising approach: iterate from Analytics to Design 7.2 Limitations 7.3 Summary of contributions REFERENCES

3


1. INTRODUCTION Over the past decade, Question and Answer (QA) sites have become a new common information source for Internet users to find answers. People can timely find answers to questions within specific domains such as programming, mathematics, or software systems using QA sites such as Stack Overflow (and other Stack Exchange sites), Quora or Brilliant.org. Some of these sites have been so successful in engaging large crowds of experts that they have become the dominant platforms for worldwide knowledge exchange and problem solving for certain specialized domains (Vasilescu et al., 2014). For example, in August 2016, Stack Overflow received an average of about 8.4 million visits and 7.7 thousands questions per day1. However, a key limitation of current QA systems is that a relatively stable portion of the questions are left unanswered or answered very late. In this context, the article proposes and evaluates the design of a mixed-initiative system. The system supports the new use case where the work of a crowd of volunteers (either customers or practitioners) is combined with the work of salaried operators. Working together, via the proposed system, they provide a paid service that timely answers the questions from customers of a technical QA site. Specifically, the paid operators focus on answering the questions that are not answered or are answered very late 2 by the crowd. We call this part of the questions not answered or answered very late by the crowd the Dark Side. In section 4 we give evidence of such Dark Side and measure its size in eight QA sites of the Stack Exchange platform. 1 2

http://stackexchange.com/sites#traffic (statistics observed on August 31, 2016) Here “late” is defined as the 5% of questions with slowest response time (i.e., those falling above 95th percentile in the distribution of questions answered by waiting time). See section 4.2.

4


A first contribution of this article is the evidence about the Dark Side. A second contribution is the proof-of-concept prototype of a mixed-initiative system, designed and tested through an experiment and a simulation study. Finally, although the proposed system has not yet been deployed as a real-world customer service, the work conducted allowed drawing a few initial lessons, which are reported at the end of the article as guidance for future research.

1.1 The customer support use case for QA sites. QA sites represent a successful case of online peer production, which is causing new use cases to emerge in business organizations. As QA sites became a primary reference for information seekers on the public Web, private companies have started exploring solutions that leverage crowds of experts to power services like support to customers on specific products. As a result, some companies have invested in addressing this new emerging use case: using QA communities to streamline customer support services or make them more efficient. For example, Adobe reduced its support costs on products such as Adobe TV by using online user forums where customers can ask questions and get answers from other customers who are delighted to demonstrate their expertise with the product (Howard 2009).

1.2 The problem with current QA systems The problem with current QA systems is that while some sites exhibit an astounding efficiency (e.g., Mamykina et al. (2011) reported an answer rate above 90% and a median answer time of only 11 minutes in Stack Overflow), they also have some consistent deficiencies that – to date - prevent them from becoming a reliable alternative

5


to traditional customer support services in industry (where paid operators are employed to answer questions). Our investigation of eight Stack Exchange sites revealed two fundamental limitations. First, we found that a significant proportion of the questions already submitted are either left unanswered or answered very late -- the Dark Side. For example, for Stack Overflow we found that almost a third of the questions submitted (29%) are not answered or are answered relatively late (see below for details). This proportion was significant across all the eight sites, ranging from 19% to 34% of their questions. Second, when a new question arrives, the current sites do not allow predicting whether the new question is likely to get a satisfactory answer or estimating the waiting time to resolve that question. Thus, the current user cannot define clear expectations and the answering community cannot preventively avoid long delays. Such a high level of uncertainty would be unacceptable for a paid service that answers customers’ questions.

1.3 Organization of the article In section 2, we presented our envisioned overall architecture for a multi-tier asynchronous helpdesk that exploits both the reactivity of the crowd and high-quality answers of paid operators. The two main components the Automatic Predictor and the Mixed-Initiative user interface are the focus of this paper. In section 3, we summarize four areas of related work.

6


In section 4, we provide analytical evidence about the existence of the Dark Side. We also present a qualitative analysis of possible reasons why a question might end up in the Dark Side. In section 5, we present and test an implementation of the automatic prediction component. In section 6, we present the mixed-initiative system prototype and two evaluation studies. The first study is an experiment with experienced users of Stack Overflow who used our proof-of-concept prototype, with the assumption that they may well represent the typical crowd manager for our envisioned system: a person with some technical skills, having a reasonable understanding of any of the questions and some practical knowledge of how the crowd may feel about each question. To complement the experiment and attenuate the limitations of not having a real deployment we also run a simulation study to show the expected benefits in a realistic setting. Finally, in Section 7 we conclude by presenting the main lessons learned by working on the system prototype, the potential benefits of its deployment in real settings, as well as limitations and possible threats to the validity of results presented in this research.

2. ENVISIONED ARCHITECTURE The envisioned system architecture is a multi-tier asynchronous helpdesk that exploits both the reactivity of the crowd and high-quality answers of paid operators when these are needed (see Figure 1). The first tier relies on a QA site for a crowd of volunteers (e.g., the community of customers).

7


Once the helpdesk user posts a question it becomes immediately visible to the crowd or community and everyone in this first tier can post an answer. As in current QA systems, the volunteers are motivated through mechanisms that publicly reward useful contributions (e.g., comments to the question) and good answers (e.g., Vasilescu 2014). Namely, participants can accrue positive (or negative) reputation points that are publicly visible on their profiles. In our system architecture, to ensure high-quality service, the unanswered questions are monitored by human operators (crowd managers), who use a mixed-initiative tool to decide whether to assign these questions to paid operators or leave them to the crowd. The mixed-initiative tool (yellow box on the right of Figure 1) includes: •

A prediction component calculating, for each question submitted to the crowd tier, the probability that it will not be replied or it will be replied too late (the definition of “too late” was provided in the introduction) and

•

A Graphical User Interface (GUI) that helps a crowd manager to decide whether to leave the unanswered question to the crowd (crowd tier) or to redirect it to the paid operators (operators tier) (see Figure 1).

The escalation of the question happens in two different cases: 1)

When the crowd manager identifies problematic questions (supported by the prediction component) (‘expert’ case in Figure 1).

2)

When questions exceed the maximum time set (‘timer’ case in Figure 1).

8


Figure 1 shows the multi-tier architecture of the overall QA system. The figure shows how the mixed-initiative tool, in tandem with the crowd manager, connects the crowd tier and the operators tier in order to provide timely answers to user’s questions.

Figure 1. The multi-tier architecture of the QA system. In this article we focus on the mixed-initiative tool for the crowd-manager as the central component of a broader service. However, for the overall design to be applicable we made three assumptions about the crowd, the operators, and the service: •

The type of questions answered by the crowd and the timing of resolution are stable enough to be predictable.

•

If the crowd cannot answer a question, then the crowd-manager has always access to a dedicated paid operator with knowledge to provide an answer.

•

The service must guarantee that any question from a customer gets an acceptable answer within a given time limit.

9


This article does not discuss nor evaluate the soundness of these assumptions rather it is aimed at demonstrating the feasibility of the mixed-initiative tool by proposing a possible implementation of the automatic prediction model and a mixed-initiative graphical interface.

3. RELATED WORK 3.1 Precursors of QA sites: Knowledge Management systems Current research on QA sites can be greatly informed by prior work on pre-Web2.0 tools and knowledge management systems adopted in organizations, such as forums or knowledge-sharing tools for experts. One of the earliest examples of community-based self-help technologies is the Eureka system at Xerox (Bobrow and Whalen, 2002), studied by Julian Orr. It allowed a community of repair technicians, across the company, to submit problems and propose solutions (Orr 1996). Another classic example is Ackerman’s Answer Garden system: a platform that lets users pose and answer questions. Answer Garden, allowed users to find answers to common questions and provided access to experts in the organization (Ackerman 2000). Closely related to our hybrid design (see section 2) is the multi-tier helpdesk service studied by Halverson and collaborators (2004), at IBM, where the answering service fed into a database of common problems. Their study focused on the creation of content for a FAQ (Frequently Asked Question) repository published on the intranet for helpdesk clients.

10


3.2 Studies on QA sites A few studies have illustrated the high efficiency of popular crowd-based QA sites. Mamykina and collaborators (2011), in their analysis of Stack Overflow, found that 50% of the questions up to 2011 had received their first answer within 11 minutes. They also investigated the factors determining this high level of efficiency. However, while they acknowledged that there was a long tail of unanswered questions, this aspect remained secondary in their study. In a report on Aardvark, which routes questions to users known to be online, Horowitz and Kamvar (2010) reported that 57.2% of the questions received their ﬁrst answer in 10 minutes. On the other hand, they reported that 15.5% of the feedback from the asker considered the answers “bad” but this aspect of the system was not investigated further. Studies concentrating on questions that remain unanswered are less common and relatively shallow. Treude et al. (2011) conducted a qualitative analysis of Stack Overflow categorizing the types of questions answered and those that remain unanswered. They categorized 385 questions (from a limited period of 2 weeks in 2010) and aimed at descriptive findings rather than prediction. More generally, we are not aware of any in-depth quantitative analysis that characterized and predicted unanswered questions in order to address them. Inspecting this aspect in more detail would inform us on how crowdsourced QA sites could be used in customer care applications. Saha et al. (2013) conducted a quantitative study on answered and unanswered questions in Stack Overflow (using 2008-2012 data). Their preliminary results did not show influence of structural attributes such as number of tags, length of questions, presence of code and external links, but observed some implicit indicators of the

11


community interest such as number of views, score, number of favorites, and number of comments. Unfortunately, their results are not applicable in our study because our prediction models need to rely on events that are available early on in the process: i.e., number of views or favorites are not available early on. Recognizing that the crowd can be leveraged in QA services, several IT companies have started to exploit the efficiency of crowdsourcing-based QA sites to power new kinds of customer care services. In some cases, they used paid crowdworkers (e.g., Crowd Engineering, 2014) and in others self-help crowds of volunteer customers (e.g., Lithium, 2014). For example, Adobe TV used online user forums to allow customers ask questions and get answers from other customers (Howard 2009). These early crowdbased solutions address a use case similar to the one targeted in this work (section 1.1).

3.3 Automatic predictions for QA tools Several researchers have proposed prediction or classification models applied to the QA domain to solve focused problems, but the work that integrates these contributions back into working QA services is still limited. Zhou, Lyu, and King (2012) proposed a classification model based on Support Vector Machines (SVM) that, given the pair question-user, computes whether the user can solve the problem. They identified twenty features that allow their model to reach promising predictive accuracy and we have adopted several of their features in our approach, too. A study by Burel and He (2013) showed that in Server Fault question complexity, defined as the level of expertise required to answer a question, is highly correlated with the topic on which it focuses, along with the asker’s involvement in the community. 12


Although the goal of that study was to provide a tool for managers to understand the maturity of an online community, which differs from our use case, in our prediction model we draw on the same insight by considering the asker’s history of contributions. Deeper analyses about expertise estimation were proposed by Zhang, Ackerman, and Adamic (2007) for the Java Forum and by Li and King (2010) and Liu, Song, and Lin (2011) for Yahoo! Answers. Zhang, Ackerman, and Adamic (2007) proposed the z-score measure, which combines asking and replying patterns characterizing the experience gained by participants. In our approach, we use their z-score metrics, among others. Li and King (2010) and Liu, Song, and Lin (2011) suggested ways to compute the relative ranking of answerers to determine who is most suitable to solve incoming questions. Zhou, Lyu, and King (2012) proposed three approaches to profile the expertise of active users and support the routing of a question. Their solutions identify the best candidate based on their effectiveness on previous answers and a Language Model of the questions. Chang and Pal (2013) used a similar approach while stressing the importance of community-related aspects. Their approach tried to recognize the most promising small set of participants that might be willing to answer a given question by considering some commonality of traits with the person who posed the question. In our approach, we also consider the characteristics of the person who posed the question but from a different stance. Yet, these works may suggest promising solutions that could be exploited in future versions of our tool to enable a fine-grained assignment of questions to operators. A closely related effort in predicting the outcome of Stack Overflow questions was done by Asaduzzaman and collaborators (2013), who used features of questions to build a

13


model for predicting unanswered questions. We use some of the same features, yet their prediction models excluded questions that do not receive any answer (which we included -- about 8.5% of our dataset) and results from their preliminary investigation appear not as promising as those of our approach. Yang et al. (2011) also built a model to predict unanswered questions in Yahoo! Answers. This site is different from any Stack Exchange site in various ways: e.g., it is a general-purpose QA site adopting different policies, chiefly the policy of a 4-days limit to answer a question. Hence, differences in scope and policies lead to different behaviors of users and their results cannot be compared with ours. All the same, we consider some features similar to the ones used in this work. Correa and Sureka (2013) studied the reasons why a question could be marked as ‘closed’ in Stack Overflow. A question is marked as closed when a moderator considers it not suitable to the community: off-topic, duplicate, too localized, not a question or subjective. They found that about 3% of the questions fall in this category. They built different prediction models obtaining relatively accurate results (accuracy of about 70%). In a second study, Correa and Sureka (2014) observed that about 8% of the questions are so poor in quality that the community votes for their deletion. As in their previous work, they compared different predictive models and selected one with the best accuracy (almost 66%). Our research focuses on a different problem in relation to these efforts, because we focus on the corpus of questions to be addressed by the crowd after these have been filtered out by users or moderators in the crowd: that is, in our analysis we excluded the deleted and closed questions.

14


Working on the other end of the spectrum of question quality, Anderson and colleagues (2012) focused on high-quality questions and answers in Stack Overflow (2008-2010). They propose the interesting new approach of analyzing questions together with their set of corresponding answers. As in our research, they exploited features such as reputation of the questioner and the speed of the actions around the question. Their results represent a modest improvement compared to random guessing in predicting the interest of the community in a question (e.g., page views). While interesting as approach to automatic prediction of perceived quality, this work remains preliminary in nature.

3.4 Mixed-initiative tools and user interfaces for sensemaking A key difference between our work and the solutions reported above is that those approaches do not integrate the human in the loop for deliberating on questions. Mixedinitiative user interfaces are those that couple automated services with direct manipulation by users in order to benefit from both machine and human intelligence (Horvitz, 1999). In this work, we follow the human-in-the-loop perspective, where the computer generates possible solutions and the role of the human is to select the best one to use (e.g., Scott, Lesh, and Klau, 2002). That is, while the tool visualizes the probabilities of resolution for each question, the user validates such predictions and makes decisions. Poulin et al. (2008) proposed a framework to visualize decisions of machine learning classifiers and the related evidence for those decisions in the context of a web-based bioinformatics tool. Although our approach is much simpler, it is framed in a similar way. We also validate our approach through a user study. Other mixed-initiative approaches

15


that use verbal explanation (e.g., Kulesza et al., 2011) are interesting but unsuitable to support our crowd managers, due to the time-pressure involved in their task. Antifakos, Schwaninger, and Schiele (2004) explored how to visualize uncertainty explicitly in a lab-based memory task. By displaying uncertainty information, they found increments in performance, in terms of hit rates, and no effect on alarm rates. Our results support and extend these findings by relating such improvements to measures of user’s workload and performance level. Ruzkio et al. (2006) discussed how in a form-filling task, the visualization of the estimated accuracy of an automatic prediction does not help the users but rather tends to confuse them. Our results partially contradict these findings and we discuss possible reasons in the last section. De Liddo and collaborators (2012) presented a mixedinitiative tool that supports collective sensemaking in a distributed community of analysts, by combining human and machine annotation of online documents. Several user interfaces have been proposed to support analysts in performing sensemaking tasks under uncertainty (Pirolli and Card, 2005). For example, several visualization tools were built to help intelligence analysts in assessing the uncertainty of the evidence that support a hypothesis. The I-scapes tool (2008) visualized the hypothesis as 2D Cartesian space with each piece of evidence displayed as a point in this space, while x-axis and y-axis indicate, respectively, the degree of support of the piece of evidence for the hypothesis and the credibility of its sources. Similarly, to our design, the visualization incorporated colors (from red to green background) as an additional visual

16


coding of the level of support for the hypothesis. Likewise, the ACH tool by Pirolli et al. (2005) used color to represent global uncertainty around each hypothesis.

3.5 Comparison to our own work While this article includes material previously published in (Piccardi et al. 2014) (in particular, the study of StackOverflow data and the user interface of the mixed-initiative tool), it mainly consists of original content. It includes new data from seven additional Stack Exchange sites that provide further evidence of the Dark Side; a small qualitative analysis of unanswered questions (previously distributed as an unpublished report); the overall system architecture; additional evaluation studies on the system and lessons that inform future research.

4. THE DARK SIDE OF QA SITES: ANALYTICS STUDY In this section, we characterize the Dark Side across eight QA sites of the Stack Exchange platform (2014), with the aim to suggest that this problem is pervasive. We define “Dark Side” as the set of questions that either did not get any answer, did not get an answer considered satisfactory by the user who posted the question, or got an answer very late (i.e., the 5% of questions that took the longest time to get a satisfactory answer).

4.1 Dataset: data from eight Stack Exchange QA sites over two years To build a corpus for our analysis, we selected eight popular QA communities or sites running on the Stack Exchange platform (2014). The QA sites considered run on the

17


same platform and thus they share the same set of functions, user interface, and mechanisms for motivating users and tracking their reputation. This makes the data collected more easily comparable among the different QA sites. Stack Exchange QA sites consistently use tags to categorize questions. Although not original as a function for collaborative classification of similar questions (Halpin, Robu, and Shepherd, 2007), the use of tags here is more structured than in traditional tagging systems because there are explicit mechanisms to favor convergence on existing keywords when tagging questions. Tagging may then be used to define sub-communities of experts around a topic (Mamykina et al., 2011). Starting from a Stack Exchange data dump dated March 2013, we selected the sites with the largest amount of data (i.e., the largest number of questions submitted, the number of participants, and with more than 2 years of activity). The result was a list of 36 sites (excluding the “Meta” Websites). Then, we excluded the sites or communities that were not old enough to be considered stable QA communities (i.e., those that in August 2010 had not already exhibited the initial growth and did not have a stable questions’ rate). Hence, we were left with the top ten sites, each of which with about 20,000 questions or more from a period of at least 2 years. Indeed, our goal was to analyze questions from a 2-year period for each site. From those ten sites, we excluded the sites English Language and Usage because they are not strictly focused on specialized technical knowledge. We also excluded the site Programmers because it overlapped with StackOverflow in terms of domain and tags. Hence, the data corpus includes these eight QA sites:

18


1. Apple (Ask Different): Apple users and developers; 2. Ask Ubuntu: Ubuntu users and developers; 3. Math (Mathematics): Math students, instructors and professionals; 4. Server Fault: professional system and network administrators; 5. SharePoint: SharePoint users and administrators; 6. Stack Overflow: professional and enthusiast programmers; 7. Super User: computer enthusiasts and power users; 8. Wordpress (Wordpress Answers): Wordpress administrators and developers. The selected QA sites differ along the following dimensions: •

Domain: product-centered (Ask Ubuntu, Apple, SharePoint), skill-centered (Math, Stack Overflow), mixed (Super User, Wordpress)

•

Domain type: technical (Ubuntu, Apple, Sharepoint, Stack Overflow, Super User, Wordpress), non-technical (Math)

•

Community age: from the oldest Stack Overflow, launched in 2008, to Super User and Server Fault, released in 2009, to the younger Math, started on 2010.

•

Community size: from the largest Stack Overflow, attracting two million users, to the mid-size Server Fault, Super Users and Ask Ubuntu in the range of thousands

19


of participants, to the smaller Apple, Ubuntu and Wordpress with about one thousand, and down to Sharepoint with less than 500 users answering questions. •

Community growth rate: Super User and Server Fault had a limited growth (respectively around 40% and 20%) in the last 2-year period; Stack Overflow doubled its user base whilst the other sites experienced faster growth rates.

The dataset used includes data from August 1st 2010 to August 1st 2012 for all questions (answered or not in the period), their answers, and relevant metadata such as the tags used on a question or votes cast on an answer. Table 1 summarizes the characteristics of the eight sites (the community size is approximated by the number of registered users that had replied to at least one question over this 2-year period). Table 1. Characteristics of the eight QA sites: counts from a 2-year period Apple Ask Ubuntu Math Server Fault SharePoint Stack Overflow Super User Wordpress

Launched Sep-08 Jan-09 Mar-10 Aug-08 Sep-08 Jul-08 Aug-08 Jan-09

Community 11,478 33,149 18,963 44,832 4,512 698,989 56,446 7,798

#Questions 17,211 46,886 59,821 72,727 13,402 2,500,390 80,451 18,238

#Answers 31,261 76,557 96,924 127,445 20,219 4,553,118 134,487 28,621

Closed questions are not included in this dataset (for a study on closed questions on Stack Overflow see Correa and Sureka, 2013 and Lezina and Kuznetsov, 2013). Closed questions are those marked by the community as off-topic and removed from the site by moderators. Across the eight sites, the proportion of closed questions ranges between 3.5% and 4.5% of all submitted questions.

20


4.2 Sizing up the Dark Side There might be several reasons why a question is not answered at all or is not answered quickly enough, thus ending up in the Dark Side. In this section, we are not interested in the ultimate reasons ‘why’ this happens (which is briefly discussed in section 4.3); rather, we are interested in investigating and measuring the Dark Side. In order to make a concrete definition of “not answered or answered too late”, we assumed the three following criteria for inclusion in the Dark Side (Figure 2, top): 1.

Questions that did not get any response (satisfactory or not) during the 2-year period considered and up to two additional months afterwards. Therefore, these questions remained unanswered within this interval even though, for some, there might have been some activity: e.g. comments were posted. As shown in Table 2, the proportion of this group of questions related to the total ranges from slightly less than 5% in Wordpress to around 15% in Ask Ubuntu and Super User.

2. Questions that did get one or more answers but none of them was satisfactory. That is, no answer has been marked as accepted or has received at least one up vote. In the Stack Exchange platform, the user who posts the question is the only one who can mark an answer as “accepted”, meaning that it solves the problem. Yet, a number of users do not bother to formally mark one of the answers to their question as accepted. In these cases, the acceptance remains dangling. In other cases, multiple answers are valid, such that multiple answers could have been validated. Unfortunately, this is not an option in the Stack Exchange platform, so that it is not always easy to separate questions without an accepted answer from 21


those with answers that are simply left unmarked. This issue was extensively debated by the Stack Exchange management and user community (see Stack Exchanged blog post ‘OK, Now Define “Answered”’, 2014). The established practice is to consider questions that have at least one answer with at least one up vote (and no accepted answer) to be resolved. Conforming to this community convention, we adopted this relatively conservative rule in our analysis. As shown in Table 2, the proportion of questions that did not get a satisfactory answer based on this definition ranges from a low 1.37% in Math to over 13% in Stack Overflow and Super User. 3. Finally, the 5% of questions that were the slowest in receiving an answer. That is, they were answered later than the remaining 95% of the questions. While this is an arbitrary definition of “late”, it has the advantage of being relative to the speed of each community and allows comparisons across communities. Table 2 (third column) and Figure 2 (bottom) show the distribution of waiting time for the 95th percentile to get a satisfactory answer. This ranges from 2.4 days for Math to 10 days for Stack Overflow, to 7 weeks for Apple (Ask Different). Since these answers clearly take too long, we classify this last 5% of questions as part of the Dark Side questions. It is worth noting, however, that this is a conservative estimate: in a real-life customer care application with hard time constraints the percentage of questions considered “too late” would probably be larger than 5%. In total, as shown in both Figure 2 (in the top chart) and Table 2 (in the left-most column), we find that the Dark Side represents a considerable portion of all questions in

22


the eight QA sites that we examined. It ranges from 19% to 34% of the total number of questions (see Figure 2), after excluding “closed” questions.

Figure 2. (Top chart) Proportion and composition of the Dark Side versus the Bright Side in eight QA sites. (Bottom chart) The 95th percentile of the distribution of waiting time, in number of days, for each of the eight QA sites. The bars are ordered based on the top chart, i.e., from the smallest to the largest Dark Side. Average: 27 days.

23


Table 2. The Dark Side in the eight QA sites: questions with no answer, no satisfying answer, and 95th percentile of waiting time in days (the 5% of questions with answers arriving after this time were labeled as “too late” for that site and thus included in the Dark Side). Data from August 1st, 2010 to August 1st, 2012.

Mathematics Wordpress Apple SharePoint Server Fault Stack Overflow Ask Ubuntu Super User Average (8 sites)

No answer 12% 5% 10% 14% 10% 10% 15% 15%

Not satisfying 1% 10% 7% 10% 14% 14% 10% 14%

11%

10%

95 percentile (in waiting days) 2 17 49 38 18 10 44 39 27

Dark Side (total %) 19% 20% 22% 29% 29% 29% 30% 34% 27%

4.3 Qualitative analysis of questions in the Dark Side A small content analysis was conducted on a random sample of 40 questions with the purpose of characterizing the questions doomed to the Dark Side (the same work was then extended in order to check and understand the misclassifications of our automatic classifier: these results are described in section 5.2.2). We randomly extracted 20 examples of questions that truly ended in the Dark Side and 20 not in the Dark Side of Stack Overflow 3. Based on Stack Exchange guidelines (Stack Exchange Meta, 2015), we defined a simple coding scheme to classify each question along five dichotomous dimensions: •

Clear - Unclear: whether the question was clearly written, both in terms of content (details provided, examples given, etc.) and in terms of form (plain

3

We extracted the 40 questions from a subset of the corpus described in section 4.2. In particular, these were part of the StackOverflow questions initially used to test for our logistic regression classifier, to evaluate the misclassifications qualitatively (as explained in section 5.2.2 below). Here the same sample is also used to characterize Dark Side questions. Although unconventional, we believe that the sampling method used did not introduce a systematic bias and, thus, it can be considered random.

24


and correct English). According to the Stack Exchange guidelines, clear questions are more likely to receive answers than those unclear or written in poor English. •

General - Unique: whether the question was of general interest vs. about a unique problem. According to the Stack Exchange guidelines, questions that may be of general interest for the community are more likely to receive answers than those about unique issues.

•

Bounded - Unbounded: whether a question is bounded to a single precise issue rather than a generic issue or problem. According to the Stack Exchange guidelines, questions that are bounded to a precise issue are more likely to receive answers than generic ones.

•

On Topic - Off-topic: whether a question is on topic with respect to the tag selected. According to the Stack Exchange guidelines, the choice of the correct tags is one of the most important aspects to consider in posing a question to the community (see section 5.1 for more details).

•

Within Tutorial – Beyond Tutorial: whether the answer might be found on a good tutorial or not. According to the Stack Exchange guidelines, questions that can be answered by simply reading a good tutorial should not be posed to the community.

Two annotators coded the questions independently. They categorized each question based on the dimensions that appeared relevant to them. The agreement between the

25


coders was high for the On Topic dimension (k=.95), moderate for the dimensions Bounded, Clear, and General (respectively, k=.60, .40 and .40) and low for the Beyond Tutorial dimension (k=.25). We speculate that the low agreement on the last dimension is due to the potential confusion between the Within/Beyond Tutorial dimensions and the Bounded/Unbounded or General/Unique dimensions. That is, Unbounded or General questions may be answered by pointing to a good tutorial. After measuring the agreement, then the results of the coding were discussed and the differences were reconciled between the two coders. That is, for each case of disagreement the two coders agreed on a common categorization of the question.

DarkSide

BrightSide

30 25 15

Count

20 15

2

10 5

14

10

13

1 4

2

0

UNCLEAR

GENERAL

UNBOUNDED

1

OFF-TOPIC

WITH-TUTORIAL

Categories

Figure 3. The distribution (with counts shown on the bars) across the five bi-polar dimensions for the two sets of questions coded: 20 in the Dark Side and 20 not in the Dark Side. The red and green color represent the two opposite poles of each dimension used to code each question.

26


All the questions had at least 1 category, with median of 1 and 1.55 categories on average (st. dev.=0.96). Figure 3 shows the counts of the categories. The Dark Side seems to include primarily questions that are unclear or too specific and, in some cases, those that lack a specific issue (unbounded questions). We did not find many off-topic questions. This is probably because they are usually quickly marked as closed by users or moderators and, as we mentioned in section 4.1, closed questions had been excluded from our corpus. Finally, the questions that could be solved simply by referencing a tutorial seem quite frequent even though they are discouraged by Stack Exchange guidelines. However, while frequent, they do not appear characteristic Dark Side questions since users are equally likely to solve them or to leave them slip in the Dark Side (though, it should be noted that this category did not reach a high level of inter-coder agreement).

5. PREDICTING THE DARK SIDE A key component of our envisaged architecture described in section 2 is the module that predicts whether a question is doomed to the Dark Side. In this section, we present a first approach to estimate the probability that a given question belongs to the Dark Side by using logistic regression. This prediction model is then evaluated using the standard quantitative metrics and a qualitative comparison with human performance.

5.1 Logistic regression model As a first step toward an automatic tool that predicts whether a question is doomed to the Dark Side, we are interested in a general-purpose predictor solely based on easy-tocompute features. In particular, we decided not to use information about the content from

27


questions, despite recent research has shown that discretized linguistic features can provide valuable information for detecting good answers (Gkotsis et al. 2014). Instead, we prefer to employ a simple predictive model that provides a consistent baseline against which to measure improvements brought by models that are more complex (i.e. models that use linguistic features need to be adapted to the specific languages of the QA sites). Hence, we focus here on relatively simple features that do not require knowledge about the domain of expertise or the content of questions. Specifically, we consider features capturing (i) structural characteristics of the question, (ii) the characteristics of the questioners, with respect to his/her previous activities in the community, and (iii) the sub-community to which the question is addressed, where the sub-community is identified by tags associated with users who monitor them. 5.1.1 Features These are the features considered in our analysis and the rationale for their inclusion. •

bodyLength and titleLength: the number of characters in the body and the title of the questions, respectively. The motivating assumption is that longer questions and titles may indicate tedious questions (Mamykina et al., 2011) or that the author wrote the question more carefully (Treude et al., 2011).

•

hasExample and ratioExample: respectively, if the question contains an example and the ratio of the length of such example (e.g., code) with respect to the whole question. The rationale is that those questions that cannot be easily reproduced with a small and self-contained example are more difficult to be answered to a

28


satisfactory extent (Mamykina et al., 2011; Treude et al., 2011). About the method used to detect examples, we leveraged conventional HTML tags. The sites of Stack Exchange network let users include a snippet of source code that will be formatted with the correct highlight, using the HTML tag ‘code’. In the Math site, in addition we considered the variant used by this community for describing a formula in line with the question text. Across the eight sites, we measured the length of the example in a comparable way by using a relative measure: i.e., we computed the ratio of the length of the text in the example over the full text of the question, after removing white spaces and new lines from both text portions. •

nTags: number of tags used in the questions (a maximum of 5 are allowed by Stack Exchange). When more tags are used, we assume that the question will be visible to a wider audience. Thus, such question is more likely to be answered as it is more likely to be noticed by relevant experts.

•

maxMedian and minMedian: respectively, the maximum and minimum among the median response times to questions marked with tags used in the query. In our dataset, all questions had at least two tags. The rationale to include these features is to account for the responsiveness of the community (Mamykina et al., 2011; Harper et al., 2008).

•

maxPeople, minPeople, and sumPeople: respectively the largest, smallest, and sum of the sizes of the tag-based communities of answerers. That is, each tag of the question has its community of subscribed answerers. Prior analyses point to

29


the benefits in efficiency of having a larger and active QA community compared to a smaller community (Harper et al., 2008). •

prevQuestions: the number of previous questions posed by the questioner. The rationale is that users may become more skilled in formulating questions by experience. Prior work suggests that accounting for past user activity helps in making more accurate predictions (Liu and Agichtein, 2008).

•

zScore: the normalized ratio of the difference between questions and answers posted by each user (Zhang et al., 2007). The rationale is that answering a greater proportion of questions (relative to one’s own activity) implies higher expertise. Zhan and colleagues propose this z-score as a proxy measure of expertise.

•

firstAnswerSpeed and firstActionSpeed: respectively, the time of the first answer and the time of the first action, were the action can be an answer, a comment to the question, or an edit to the question. The rationale is to exploit information from early events in the QA process, thus useful for prediction purposes, in addition to the properties of the question or the questioner. This feature is computed as follows: 𝑡𝑡 1 − , 𝑖𝑖𝑖𝑖 𝑡𝑡 < 𝑡𝑡ℎ 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑡𝑡) = � 𝑡𝑡ℎ 0, 𝑖𝑖𝑖𝑖 𝑡𝑡 ≥ 𝑡𝑡ℎ where

is the median time of the first answer and the first action, relative to

their posting timestamp. The relative speed gets closer to 1 as t becomes smaller, that is, when the answer (or the action) was fast. Earlier research by Hanrahan,

30


Convertino, and Nelson (2012) on Stack Overflow data showed that the speed of arrival of answers is a strong predictor of questions that fall in the Dark Side. •

weekend: categorical variable indicating whether the question was posed in the weekend or on a weekday. The rationale is that there may be different levels of activity between weekdays and weekends that might impact on the likelihood of getting an answer (Treude et al., 2011).

5.1.2 Classification For the sake of simplicity, we want to predict if a given question will end up in the Dark Side, that is, if it will not receive a quick and accurate answer. This is called a binary classification problem and several approaches are available in machine learning literature to address it (e.g., Mitchell, 1997). As a classifier, we choose the logistic regression (Agresti, 2002) because fast algorithms are available to train models and because it provides a useful probabilistic interpretation of its results. Furthermore, with respect to other probabilistic classifiers, such as decision trees, it is less sensitive to independence of the features (Mitchell, 1997). A disadvantage of this approach is its weakness on unbalanced datasets (which is anyway common to several other approaches). To compensate for this limitation and for practical reason of time and computational resources, we use 10 random datasets balanced between Bright Side and Dark Side classes, sampled from each site. The sizes of these 10 random datasets from each site to train the models are: Apple: 5,257; Ask Ubuntu: 16,824; Math: 15, 742; Server Fault: 22,503; Sharepoint: 6,402;

31


Stack Overflow: 22,379; Super User: 22,349; Wordpress: 3,418; All sites: 114,884. The values of the continuous features are normalized across the eight sites, (dividing by the respective maximum value for each feature found among all sites), in order to compare the effects. Since our purpose is to provide a reasonable accurate prediction as soon as possible after the submission of each question, we train our models at three subsequent times: •

at submission time (i.e., using all features described above but the two time-based ones: firstActionSpeed and firstAnswerSpeed),

•

at the median time for the first action (using all features described above and including firstActionSpeed in the set of features) and,

•

at the median time for the first answer (using all features described above and including firstAnswerSpeed in the set of features).

In order to better frame the problem, we build individual models for each site as well as models that consider all sites together. This approach let us investigate the common traits among the sites, as well as building models with higher predictive power by possibly exploiting specific characteristics. All these models are built as logistic regressions using a variation of the ridge estimators approach (le Cessie and van Houwelingen, 1992) available in Weka (Witten, Frank, and Hall, 2011), with the ridge parameter set at 1.0E-8.

32


Figure 4. Precision (top), Recall (medium), Area Under the Curve or AUC (bottom) for the three classifiers. The horizontal axis is truncated at 0.5; error bars indicate (95% CIs). The two charts in Figure 4 summarize the figures of merit (precision and recall for the three sets of models at submission time, median time for the first action, and median time for the first answer. These correspond to the three bars in each chart; the values are 33


average from 10 runs and confidence intervals are also displayed. A default probability threshold of 0.5 was selected to perform the binary classification in all cases. Figure 5 plots the ROC and the cost function (cost of false positives and false negatives) as a function of the selected probability threshold, for the Stack Overflow site, in particular. We can see that, despite our default threshold choice is not optimal, it does fall in the green zone of good classification performance at affordable cost. Choosing the optimal threshold for different communities would render different values, making comparisons more difficult. Thus, we decided to use the default threshold for all cases, double-checking that the performance of any of these classifiers was not compromised by this choice.

Figure 5. Plots of the ROC (left) and cost of our binary classifier (regarding false positives and false negatives) as a function of the selected threshold (right), for the Stack Overflow site. The dashed lines correspond to the default threshold of 0.5 that we have set in this analysis. Prediction at submission time. This first set of models evaluates the probability of a question to be doomed to the Dark Side at submission time. It uses all features described

34


above but firstAnswerSpeed and firstActionSpeed. The prediction power is quite limited as seen by the precision and recall for all sites, ranging from .58 to .65. Stack Overflow and Apple have slightly better performance than the others. Prediction at median time for the first action. This set of models evaluates the probability at most at the median time of the first action (either a comment or the first answer), in order to compute the firstActionSpeed value. That means that our prediction happens late with respect to submission time but the predictive power is higher with precision and recall ranging from .64 to .70 (with Stack Overflow having slightly better performance that the other sites). Prediction at median time for the first answer. This set of models uses the whole set of features, including the value for firstAnswerSpeed, which requires waiting at most until the median time for the first answer (around 17 minutes on average). Nevertheless, the precision and recall reaches higher levels getting close to or above .7 for all models (but Wordpress scores slightly worse than the others). Across the three subsequent sets of models, we observe a clear benefit for prediction accuracy when including early timebased predictors. The performance of these models does not seem to be correlated with the size of the community. Below, we first discuss the contribution of the individual features to the overall performance, and then we evaluate the models. We also run the same analysis using a Bayes Net model and obtained a similar classification accuracy (.66-.72 using the same features of the third model), which supported the reliability of the results presented above.

35


5.1.3 Comparing the features To compare the impact of features we discuss the average odds ratios from the logistic regression models at First Answer time. The odds ratios estimate the change in the odds of an instance (in this case a question) to be classified in the target class (in our case the Dark Side) for a one-unit increase of each feature. Table 3. Odds ratios from the logistic regression model at First Answer time for the features (rows) on each site (columns). The background color indicates in light red the features that predict higher odds for the Dark Side, in light blue those that predict lower odds for the Dark Side, in white those that are milder predictors or not consistent across the sites. The contribution to the prediction can be interpreted from the values, as their difference from 1.0. The 95% confidence intervals were smaller than 0.1 Apple

firstAnswerSpeed prevQuestions minPeople sumPeople zScore ratioExample nTags titleLength firstActionSpeed weekend maxMedian maxPeople hasExample bodyLength minMedian

0.29 0.76 0.79 0.92 0.89 0.96 1.03 1.05 1.13 0.84 1.25 0.84 1.00 1.27 1.23

Ask Math Serv. Share Stac Super Wor All Ub. Fault Point kOv. User dpr. sites 0.43 0.12 0.42 0.37 0.38 0.39 0.37 0.37 0.61 0.87 0.69 0.81 0.60 0.67 0.81 0.69 0.92 0.74 0.77 0.83 0.63 0.80 0.83 0.78 0.32 0.72 0.95 1.04 0.91 0.92 1.04 0.87 0.86 0.92 0.88 0.93 0.80 0.89 0.93 0.87 0.98 1.03 1.04 0.94 1.09 1.02 0.94 0.99 1.05 1.23 0.96 0.96 0.94 0.95 0.96 0.98 1.03 0.96 1.01 1.08 1.03 1.01 1.08 1.01 1.10 1.00 1.10 1.07 1.00 1.14 1.07 1.09 0.92 1.05 1.19 1.28 1.17 0.95 1.28 1.04 0.99 1.08 1.10 1.11 1.02 1.28 1.11 1.08 2.85 1.00 0.88 0.88 1.07 0.81 0.88 0.99 1.35 1.23 1.11 1.17 1.64 1.25 1.17 1.22 1.16 1.19 1.12 1.17 1.12 1.17 1.17 1.17 1.24 1.27 1.25 1.17 1.54 1.19 1.17 1.25

Table 3 shows the odds ratios for the features on each site. The values presented are averages from running the logistic regression over ten independent sets (the size of 95% confidence intervals for these values were consistently smaller than 0.1).

36


Six features predicted more strongly and consistently the Dark Side across the eight sites. These are the following (in order of impact): •

Speed of the first answer by the community to the question (firstAnswerSpeed);

•

Number of previous questions posed by the questioner (prevQuestions);

•

Minimum median response time or responsiveness of the fastest tag-based community (minMedian);

•

Presence of an example (or code) in the question (hasExample);

•

Number of answerers of the smallest tag-based community, among the tags (minPeople).

•

Length of the body of the question (bodyLength).

Thus, in Stack Exchange, a question is more likely to be doomed to the Dark Side when the speed of the first answer is slower, the questioner has less experience asking questions, the median response time of the fastest tag-based community of answerers is higher, there are more people associated with the smallest tag-based community, the question contains an example (or code), and the text of the question is longer. The zScore is a milder predictor, but it is still consistent across the sites: the higher the normalized ratio between answers and questions delivered by a user, in his prior activity on the site, the less likely is a new question posed by that participant to fall in the Dark Side. This suggests that a greater experience answering questions by the questioner

37


allows him/her to better formulate new questions and thus improve the chances for such questions to be timely resolved. The other four features were less strong predictors and were not consistent across the sites. Asking during the weekend (as opposed to a weekday) can penalize the asker in sites such as SharePoint and WordPress but can be advantageous in other sites such as Apple. As mentioned above, a lower speed of the fastest tag-based community consistently predicts the Dark Side (minMedian). In contrast, a lower speed of the slowest tag-based community (maxMedian) predicts the Dark Side in Super User and Apple but it does not in Ask Ubuntu and Stack Overflow. Besides, the sum of the sizes of all the tagbased communities (sumPeople) is a predictor for three sites (the more people the better) but it is not influential in the other sites. Similarly, the size of the largest tag-based community is not a consistent predictor and exhibits an unusual inversion of its impact for the Ask Ubuntu site, which calls for future qualitative research. Differently from the speed of the first answer, which is a strong predictor across sites, the speed of the first action (either a comment, edit, or answer) is either a mild predictor with opposite impact (a faster first action predicts the Dark Side) or not influential. The observed contrast among the two early time-based predictors and the site-specific profiles with respect to the various features (for example, the large effect of maxPeople in Ask Ubuntu, Table 3) call for further qualitative and quantitative researcher to be better explained.

5.2 Testing the logistic regression model 5.2.1 Quantitative analysis of prediction accuracy 38


The models above have also been tested on datasets that include all questions posted in the two months following the period used for training. For the sake of clarity, we just discuss here the case of predicting at the median time for the first answer in the case of Stack Overflow. The same type of analysis performed on the other sites and with the alternative models led to similar results. The testing dataset for Stack Overflow had 295,710 questions, with approximately 35% in the Dark Side and 65% in the Bright Side. None of the questions included in the test dataset were used in the training phase. The two classes, as expected, are quite unbalanced. Hence, we report precision and recall. It might also be interesting to assess the performances by looking at the true/false positive and true/false negative (see Table 4). The natural unbalance of the two classes determined, as expected, an increase in the number of false positives (question erroneously classified as in the Dark Side) with respect to the cross-fold evaluations on the balance datasets with a consequent reduction of the precision of the estimation. Yet, the robustness with respect to the false negative (question erroneously classified as in the Bright Side) produced an increase of the recall measure. Table 4. Assessment of the median answer time model for Stack Overflow on the unbalanced test dataset. Low and high confidence refer to the prediction with probability much higher (>.7) or much lower (.7) or much lower (

Toward a Mixed-Initiative QA system

Toward a Mixed-Initiative QA system

Suggest Documents

Toward a Mixed-Initiative QA system

building generalize qa system, slr

Extracting Answer in QA System: Learning Based

Evaluating Robustness Of A QA System Through ... - LREC Conferences

KbQAS: A Knowledge-based QA System - CEUR-WS.org

Feasibility study of performing IGRT system daily QA using a ...

A Practical QA System in Restricted Domains - CiteSeerX

Imaging system QA of a medical accelerator ... - Wiley Online Library

Soul as Operating System: A Soul Whispering QA

Toward A Secure System Engineering Methodology [PDF]

Toward a run-free financial system - CiteSeerX

Toward a run-free financial system - CiteSeerX

Toward a Hydrogen-Based Transportation System ... - CiteSeerX

Toward A Secure System Engineering Methodology Abstract

Toward a phylogenetic system of bioiogkal ... - ScienceDirect.com

Root System Markup Language: Toward a Uniied

Toward a New Classification System for Mental

TOWARD A TIMBRAL CLASSIFICATION SYSTEM FOR MUSICAL ...

Toward a Cooperative Experimental System Development Approach*

Toward a Gigabit Wireless Communications System - IJCNIS

Toward a Global Biodiversity Observing System - CiteSeerX

WBS: 1.2.12 QA: QA Civilian Radioactive Waste

AnnexesEB823standard - QA

Rainforest QA