Evaluating Answer Quality across Knowledge Domains: Using Textual and Non-textual Features in Social Q&A Hengyi Fu School of Information Florida State University
[email protected]
Shuheng Wu Graduate School of Library and Information Studies Queens College, City University of New York
[email protected]
ABSTRACT
As an increasing important source of information and knowledge, social questioning and answering sites (social Q&A) have attracted significant attention from industry and academia, as they address the challenge of evaluating and predicting the quality of answers on such sites. However, few previous studies examined the answer quality by considering knowledge domains or topics as a potential factor. To fill this gap, a model consisting of 24 textual and non-textual features of answers was developed in this study to evaluate and predict answer quality for social Q&A, and the model was applied to identify and compare useful features for predicting high-quality answers across four knowledge domains, including science, technology, art, and recreation. The findings indicate that review and user features are the most powerful indicators of high-quality answers regardless of knowledge domains, while the usefulness of textual features (length, structure, and writing style) varies across different knowledge domains. In the future, the findings could be applied to automatically assessing answer quality and quality control in social Q&A. Keywords
Social question and answer sites, answer quality, quality assessment. INTRODUCTION
Social questioning and answering sites (social Q&A) have become an increasingly important online platform for knowledge building and exchange. Social Q&A allows individuals to post questions, provide answers and comments, and evaluate questions and answers among the peers. The popularity of social Q&A has led to a growing body of literature regarding assessing and predicting answer quality. Most of the prior work has focused on this issue from two perspectives – users and data. The first group investigated what criteria users applied to evaluate answer {This is the space reserved for copyright notices.] ASIST 2015,November 6-10, 2015, St. Louis, MO, USA. ©2015 Hengyi Fu, Shuheng Wu, and Sanghee Oh.
Sanghee Oh School of Information Florida State University
[email protected]
quality (Kim & Oh, 2009; Shah & Pomerantz, 2010; Zhu, Bernhard, & Gurevych, 2009). Zhu et al. (2009) developed a multi-dimensional model for assessing answer quality from user perspectives, including informativeness, completeness, readability, conciseness, truthfulness, level of detail, originality, objectivity, novelty, usefulness, and expertise. This model was later used to evaluate answer qualities of Answerbag and Yahoo! Answers based on human judgement (Shah & Pomerantz, 2010; Zhu et al., 2009). The second group focused on extracting data features from questions and/or answers to automatically predict answer quality (Dalip, Gonçalves, Cristo, & Calado, 2013; Shah & Pomerantz, 2010), but the selection of features varies across studies and depends on the researchers’ perceptions of quality and the availability of different features. Despite an ongoing concern pertaining to the answer quality in social Q&A, there is a lack of research and understanding of knowledge domain (topic) as a potential factor in assessing and predicting answer quality from both user and data perspectives. The criteria that a user applies to evaluate answers to science-related questions may be different from those of art-related questions. The usefulness of a particular quality criterion in predicting answer quality may also vary across knowledge domains. For example, the number of reference resources an answer includes may be highly correlated to the answer quality of science-related questions, but may not be very useful in predicting the answer quality of art-related questions. Harper, Raban, Rafaeli, and Konstan (2008) concluded that topics had a small and marginally significant effect on answer quality, but did not go further to study how it could influence quality assessment. Therefore, the purpose of this study is to fill this gap by (a) developing a model to evaluate and predict answer quality for social Q&A and (b) applying this model to identify and compare useful features of highquality answers across different knowledge domains. RELATED WORK
Answer quality in social Q&A can be captured either directly through textual features, such as content length or structure, or indirectly through non-textual features, such as review histories or author reputation. Textual features can be further divided into four groups: length, structure, style, and readability of texts. Non-textual features can be data
related to users, reviews, and networks in social Q&A (Dalip et al., 2013).
Overflow, and concluded that these features were not very indicative of answer quality.
Textual features
Non-textual features
Length features include the count of characters, words, or sentences, which give hints about whether the text is comprehensive. The intuition behind them is that a mature text of good quality should be neither too short, which may indicate an incomplete coverage, nor excessively long, which may indicate verbose contents. Length features have been proven to be one of the most powerful indicators of quality in both Wikipedia (Blumenstock, 2008; Hu et al., 2007) and social Q&A (Agichtein et al., 2008; Harper et al., 2008).
User features reflect users’ activities and expertise levels in a social Q&A site. The intuition behind user features is that content quality can be inferred by examining the author who provides it. User features, such as account history (e.g., member since), community role, achievement, and reputation, have been used and proven to be significant quality predictors in different knowledge creation communities (Shah & Pomerantz, 2010; Stvilia, Twidale, Smith, & Gasser, 2005; Tausczik & Pennebaker, 2011). In social Q&A, user features extracted from answerers’ profiles and the points that they earned were the most significant features for predicting the best quality answers (Shah & Pomerantz, 2010).
Structure features are indicators of how well the content of texts is organized. Previous studies on the quality of Wikipedia articles highlighted the significance of features that reflect organizational aspects of the article, such as the number of sections, images, external links, and citations (Hasan, André Gonçalves, Cristo, & Calado, 2009). Although the content structure of answers in social Q&A is not exactly the same as that of Wikipedia articles, some structure features can still be used to predict answer quality. For example, the presence of references and external resources are known as influential features of high-quality answers in social Q&A (Gazan, 2006). Style features are intended to capture the author’s writing style. The intuition behind them is that high-quality content should present some distinguishable characteristics related to word usage, such as the use of auxiliary verbs, pronouns, conjunctions, prepositions, and short sentences. Style features have been proven as effective in predicting the quality of user-generated content in Wikipedia articles (Lipka & Stein, 2010) and answers in social Q&A (Hoang, Lee, Song, & Rim, 2008; Dalip et al., 2013), but the selection of style features varies among studies. Readability features are intended to estimate the minimal age group necessary to comprehend a text. The intuition behind them is that high-quality content should be well written, understandable, and free from unnecessary complexity. Several measures of readability have been proposed, including the Gunning-Fog Index, the FleschKincaid Formula, and SMOG Grading. These measures combine the number of syllables, words and sentences in the text, and were used to predict content quality in Wikipedia (Rassbach, Pincock, & Mingus, 2007) and social Q&A (Dalip et al., 2013), but resulted in low effectiveness in both settings. Another set of textual features proposed specifically for online question answering is relevance features, identifying the similarity between the answer and its associated question, such as the number of words or sentences shared by the questions and answers. Dalip et al. (2013) tested relevance features using question-answer pairs from Stack
Review features are those extracted from the review history or comments on certain content. These features are useful for estimating the maturity and stability of content. It can be expected that a content receiving many edits or comments has likely been improved over time. In general, the more the content was reviewed, the better its quality. Positive correlation exists between content quality and review features, such as the number of edits/revisions, discussions, and comments (Lih, 2004; Stvilia et al., 2005; Wilkinson & Huberman, 2007). Network features are those extracted from the connectivity network inherent within a community. The motivation for using these features to evaluate content quality is that citations between contents can provide evidence about their importance. For example, although a high-quality Wikipedia article should be stable, such stability may result from poor quality, and thus no one is interested in it. It is therefore important to take into account the popularity of content during quality evaluation, which can be estimated by measures taken from its connectivity network. Network features such as PageRank, link count, and translation count, have been tested in Dalip et al.’s (2011) study on Wikipedia article quality, but did not show high effectiveness. RRESEARCH QUESTIONS
This poster addresses the following research questions: RQ1: What are the textual or non-textual features of social Q&A that can be useful to assess quality of answers? RQ2: How do the features of high quality answers in social Q&A vary across different knowledge domains, especially in science, technology, art, and recreation? DATA COLLECTION AND RESEARCH METHOD
To answer the first research question, this study selected a set of 24 quality features, which have been reported in literature review as useful for predicting the quality of contents in Wikipedia and social Q&A. Table 1 shows the
five feature groups of the 24 features. The usefulness of these quality features was tested and reported in this poster. Feature group
Features
Length
Counts of words (l-wc) and sentences (l-sc)
Structure
Counts of paragraphs (s_pc), images (s_ic), external links (s_elc), internal links (to other question/answers in the site) (s_ilc)
Style
Counts of auxiliary verbs (t_avc), “to-be” verb (t_bvc), pronoun (t_pc), conjunction (t_cc), interrogative pronoun(t_ipc), subordinating conjunction(t_scc), preposition(t_prc), nominalization(t_nc), and passive voice sentence(t_pvc)
User
Counts of user points (u_point), merit badges (u_mbc), questions posted by a user (u_qc), answers posted by a user (u_ac), comments posted by a user (u_cc), suggested edits (u_ec) made by a user
Review
Counts of revision (r_rc), and comments an answer received (r_cc), answer age/day (r_aage)
Table 1. Quality features of five different groups
To answer the second research question, this study collected data from StackExchange1, a group of popular social Q&A sites, currently with 133 different topical Stack Exchange websites hosting almost 3.8 million users. As of May 2015, 3.1 million questions have been posted, eliciting 4.5 million answers. The 133 social Q&A sites are grouped into six topic areas, including science, technology, business, professional, culture/recreation, and life/arts. This study selected four to represent different knowledge domains: science, technology, art, and recreation; and uses one site from each area based on their similarities in site age, membership, and active levels, namely Mathematics, Ask Ubuntu (Q&A for Ubuntu developers), Photography, and Arqade (Q&A for passionate video gamers), to compare the features associated with high quality answers in these sites. High quality answers were identified using StackExchange’s Answer Score, which is calculated by the number of upvotes minus the number of downvotes that an answer received. With the help of StackExchange API2, the top 1,000 answers with the highest scores were extracted to represent high-quality answers from each of the four sites. One of the researchers developed C++ code to collect data related to structure features of answer contents. Length and style features of answers were calculated by the GNU Style 1
http://stackexchange.com/sites?view=list#traffic
2
https://api.stackexchange.com/docs/
and Diction software program3. Data related to user and review features associated with the 1,000 answers were extracted from the four sites directly using StackExchange API. Exploratory factor analysis (EFA) was conducted to identify and compare the features in the five groups (factors) among the four sites using SPSS. EFA is a method aimed at extracting maximum variance from the dataset within each factor. For each dataset, all 24 features were tested using Maximum Likelihood as the extraction method and varimax rotation; factors having eigenvalues above 1 were retained. A cut-off of 0.6 was selected for analyzing factor loading since factor loading above this value is generally considered high (Field, 2005). FINDINGS AND DISCUSSION
The results of EFA for each site are given in Table 2. Factor 1 for Mathematics contains two structure features: image count and external link count. Factor 2 is dominated by review features, including revision count and comment count. The count of comments that the author made is significantly related to this factor as well. All features in Factor 3 are related to user/author, including the count of merit badges, answers, comments, and suggested edits. Factor 4 is a combination of length (sentence count), structure (internal link count), and style (passive voice sentences count). This factor can be interpreted as content or text related. For Ask Ubuntu, factor 1 is mainly about the user/author, including points, numbers of merit badges, comments, and suggested edits. Review features play an important role in factor 2. Similar to Mathematics, factor 3 has a good mix of length (sentence count), structure (external link count and internal link count), and style (preposition count). Factor 1 for Photography contains three features: image count, revision count, and comment count. Factor 2 is mainly related to user features (merit badge count, answer count, and comment count a user made, and the answer age). Factor 3 includes two length features (counts of words and sentences) and one structure feature (external link count). Factor 4 is dominated by style features. For Arqade, Factor 1 has three features related to the review (revision count and comment count) and user (user points) groups. Factor 2 contains only one length feature (word count). Factor 3 contains three features corresponding to features of the user group (merit badge count, answer count, and comment count a user made). Factor 4 is dominated by structure features (paragraph count, image count, and external link count). Factor 5 is mainly related to style features (pronoun count and subordinating conjunction count).
3
http://www.gnu.org/software/diction/
Sites
Factor
Features
Mathematics (Science)
1
s_ic(.816),s_elc(.960)
2
u_cc(.627),r_rc(.904),r_cc(.967)
3
u_mbc(.997),u_ac(.747), u_ec(.926)
4
l_sc(.815),s_ilc(.741),t_pvc(.604)
1
u_point(.873),u_mbc(.868),u_cc(.652),u _ec(.742), r_aage(.606)
2
l_wc(.634),r_rc(.858),r_cc(.914)
3
l_sc(.763),s_ilc(.714),s_elc(.623),t_prc(. 667)
1
s_ic(.704),r_rc(.823),r_cc(.876)
2
u_mbc(.872),u_ac(.695),u_cc(.740),r_a age(.800)
Ask Ubuntu (Technology)
Photography (Art)
Arqade (Recreation)
3
l_wc(.918),l_wc(.755),s_elc(.793)
4
t_scc(.721),t_pc(.738),t_prc(.817)
1
u_point(.810),r_rc(.726),r_cc(.872)
2
l_wc(.804)
3
u_mbc(.903),u_ac(.765),u_cc(.737)
4
s_pc(.689),s_ic(.842),s_elc(.709),t_prc(. 641)
5
t_pc(.870),t_scc(.616),t_nc(.722)
Table 2. A summary of quality groupings (factors) in social Q&A sites
The results show that the review and user features are the most useful groups, which are both non-textual features. Review features, especially revision count and comment count, are indicators of high-quality answers across the four sites. These features, measuring user engagement in an answer, could be the most effective indicators of answer quality. User features, such as the count of merit badges, answers, and comments a user made also present usefulness in identifying high-quality answers, regardless of the knowledge domains in the sites. This highlights the importance of user profiles and the history of a user in assessing answer quality. The findings also indicate that the importance of textual features (length, structure, and writing style) vary across different knowledge domains. For example, structure features, such as image count and external link count, are the most essential features of high-quality answers in
Mathematics Q&A, indicating user’s recognition of using equations (usually presented in image format) and references in science related answers. Similarly, image count is a powerful indicator of answer quality in Photography Q&A, suggesting that attaching real photos in answering these kinds of questions is essential. The length feature plays an important role in assessing answer quality in Photography and Arqade. This implies that for social Q&A addressing more “subjective” topics (e.g., art, recreation), people tend to consider that the answer length has a positive relation to answer quality. Style features, such as pronoun count, subordinating conjunction count, preposition count, nominalization count, and passive voice sentence count, are useful for identifying high-quality answers in art and recreation social Q&A sites, but not that important for those addressing science and technology questions. While review and user features are the most important feature groups across different knowledge domains, text features are also useful, and more importantly, less difficult to generate in terms of the preprocessing required. IMPLICATIONS AND CONCLUSION
This study develops a model consisting of 24 quality features for assessing answer quality in social Q&A, and then uses the model to identify and compare useful features of high-quality answers in four different knowledge domains. This study has a few limitations in that StackExchange is one of the social Q&A sites; it may not represent all other social Q&A sites. Also, the four sites that have been randomly selected may not represent all other sites of each knowledge domain. The proposed model of predicting answers with the textual and non-textual features from social Q&A can be applied for automatic assessment of answer quality and quality control in social Q&A. The model provides a comprehensive but critical set of quality features, and it can be used to build classifiers to select high-quality answers across knowledge domains. Future research can validate or expand this model by testing it in social Q&A sites from other knowledge domains. This study also reveals that highquality answers in different knowledge domains are associated with a variety of quality features. Social Q&A designers can customize the reward and promotion system of different knowledge domains to encourage contributions that contain the most useful quality features for a particular domain. For example, since review features are strongly associated with quality across knowledge domains, more points could be awarded to those who submit comments or edit suggestions to answers provided by others. For science or technology, more points and merit badges can be granted to those who provide equations, source code, and references in their answers in social Q&A. REFERENCES
Agichtein, E., Castillo, C., Donato, D., Gionis, A., & Mishne, G. (2008). Finding high-quality content in social
media. In Proceedings of the 2008 International Conference on Web Search and Data Mining (pp. 183194). ACM. Blumenstock, J. E. (2008). Size matters: Word count as a measure of quality on Wikipedia. In Proceedings of the 17th International Conference on World Wide Web (pp. 1095-1096). ACM. Dalip, D. H., Gonçalves, M. A., Cristo, M., & Calado, P. (2011). Automatic assessment of document quality in web collaborative digital libraries. Journal of Data and Information Quality, 2(3), 14. Dalip, D. H., Gonçalves, M. A., Cristo, M., & Calado, P. (2013). Exploiting user feedback to learn to rank answers in Q&A forums: A case study with Stack Overflow. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 543-552). ACM. Field, A. (2009). Discovering statistics using SPSS. Sage. Gazan, R. (2006). Specialists and synthesists in a question answering community. In Proceedings of the American Society for Information Science and Technology, 43(1), 1-10. Hasan Dalip, D., André Gonçalves, M., Cristo, M., & Calado, P. (2009). Automatic quality assessment of content created collaboratively by web communities: A case study of Wikipedia. In Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital libraries (pp. 295-304). ACM. Harper, F. M., Raban, D., Rafaeli, S., & Konstan, J. A. (2008). Predictors of answer quality in online Q&A sites. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 865-874). ACM. Hoang, L., Lee, J. T., Song, Y. I., & Rim, H. C. (2008). A model for evaluating the quality of user-created documents. In Information Retrieval Technology (pp. 496-501). Springer Berlin Heidelberg. Hu, M., Lim, E. P., Sun, A., Lauw, H. W., & Vuong, B. Q. (2007). Measuring article quality in Wikipedia: Models and evaluation. In Proceedings of the sixteenth ACM Conference on Information and Knowledge Management (pp. 243-252). ACM.
Kim, S., & Oh, S. (2009). Users' relevance criteria for evaluating answers in a social Q&A site. Journal of the American Society for Information Science and Technology, 60(4), 716-727. Lih, A. (2004). Wikipedia as participatory journalism: Reliable sources? Metrics for evaluating collaborative media as a news resource. In Proceedings of the 5th International Symposium on Online Journalism, Austin, USA. Lipka, N., & Stein, B. (2010). Identifying featured articles in Wikipedia: Writing style matters. In Proceedings of the 19th International Conference on World Wide Web (pp. 1147-1148). ACM. Rassbach, L., Pincock, T., & Mingus, B. (2007). Exploring the feasibility of automatically rating online article quality. In Proceedings of the 2007 International Wikimedia Conference, Taipei, Taiwan (p. 66). Shah, C., & Pomerantz, J. (2010). Evaluating and predicting answer quality in community Q&A. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 411-418). ACM. Stvilia, B., Twidale, M. B., Smith, L. C., & Gasser, L. (2005). Assessing information quality of a communitybased encyclopedia. In Proceedings of the International Conference on Information Quality (pp.442-454). Cambridge, MA: MITIQ. Tausczik, Y. R., & Pennebaker, J. W. (2011). Predicting the perceived quality of online mathematics contributions from users' reputations. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 1885-1888). ACM. Wilkinson, D. M., & Huberman, B. A. (2007). Cooperation and quality in Wikipedia. In Proceedings of the 2007 International Symposium on Wikis (pp. 157-164). ACM. Zhu, Z., Bernhard, D., & Gurevych, I. (2009). A multidimensional model for assessing the quality of answers in social Q&A sites. Technical Report TUD-CS2009-0158. Technische Universität Darmstad.