Using Field Experimentation to Understand Information Quality in User-generated Content Roman Lukyanenko Florida International University
[email protected]
Jeffrey Parsons Memorial University of Newfoundland
[email protected]
Introduction The rise and increased ubiquity of online interactive technologies such as social media or crowdsourcing (Barbier et al. 2012; de Boer et al. 2012; Doan et al. 2011; Whitla 2009) creates a fertile environment for field experimentation, affording researchers the opportunity to develop, test and deploy innovative design solutions in a live setting. In this research, we use a real crowdsourcing project as an experimental setting to evaluate innovative approaches to conceptual modeling and improve quality of user-generated content (UGC). Organizations are increasingly looking to harness UGC to better understand customers, develop new products, and improve quality of services (e.g., healthcare or municipal) (Barwise and Meehan 2010; Culnan et al. 2010; Whitla 2009). Scientists and monitoring agencies sponsor online UGC systems - citizen science information systems - that allow ordinary users to provide observations of local wildlife, report on weather conditions, track earthquakes and wildfires, or map their neighborhoods (Flanagin and Metzger 2008; Haklay 2010; Hand 2010; Lukyanenko et al. 2011). Despite the growing reliance on UGC, a pervasive concern is the quality of data produced by ordinary people. Online users are typically volunteers, resulting in a user base with diverse motivations and variable domain knowledge (Arazy et al. 2011; Coleman et al. 2009). When dealing with casual contributors external to the organization, traditional approaches to information quality (IQ) management break down (Lukyanenko and Parsons 2011; Parsons and Lukyanenko 2011). Traditionally, information production processes are assumed to be designed to support the needs of data consumers – typically employees or others associated with the sponsoring organizations that require information for decision-making and other tasks (Lee and Strong 2003; Redman 1996). Consequently, data production typically conforms to the way the data is to be used. For example, as biological species is a major unit of scientific analysis, the prevailing practice in online citizen science (e.g., see www.eBird.org) is to require online volunteers to classify the observed phenomena (e.g., birds) at the species-level of specificity
1
(Hochachka et al. 2012; Silvertown 2010). This requirement, however, may not be realistic for some casual contributors – often biology non-experts who might struggle to accurately identify observed phenomena at the species level (Parsons et al. 2011). In this research, we advance an alternative perspective on IQ in UGC. We argue that conventional wisdom in IQ underrepresents the critical role of information contributors. In a flexible and open UGC environment, with weak controls over information production, data contributors should be given more freedom to determine what and how much information to provide. Adopting a contributor-oriented perspective, however, requires rethinking fundamental approaches to information systems (IS) design. In the traditional development paradigm, IS structure reflects intended uses of information as defined by information consumers. These views shape design of user interfaces, application logic, and data structures defined in databases. As the structure and behavior of IS objects are informed by information modeling, we focus on impact of information models on IQ. Prevailing modeling grammars (e.g., Entity Relationship diagrams) organize domains in terms of predefined abstractions (e.g., classes such as biological species) (Parsons and Wand 2008; Parsons and Wand 1997). We argue that employing traditional information modeling in UGC settings may have detrimental impact on IQ and propose an alternative approach to modeling which does not require contributors to classify phenomena. Following ontological and cognitive principles (see Lukyanenko and Parsons 2013), we believe contributors should be allowed to describe observed phenomena using attributes and, when confident in the classification, classes. As contributors may hold different views about domain phenomena, each contributor should be free to provide his/her own attributes and classes rather than be constrained by the classes defined in advance (even if these reflect intended uses of UGC). We thus advocate a use-agnostic approach to IQ management (Lukyanenko et al. in press; Parsons et al. 2014) and an instance-based approach to information modeling (Lukyanenko and Parsons 2012; Parsons and Wand 2000) in UGC settings.
Field Experimentation A use-agnostic approach to IS development raises questions of feasibility and practicality of approaching data collection from the contributor’s point of view, and the extent to which the resulting data becomes useful to data consumers. To address these issues, we designed a field experiment in which we could simultaneously demonstrate how to construct a real use-agnostic and instance-based IS and evaluate it relative to a traditional (class-based) IS. To that extent, we
2
redesigned an existing web-based citizen science project, NLNature (www.nlnature.com) that collects sightings of plans and animals in Newfoundland and Labrador (Canada). We randomly assigned a subset of NLNature users to one of two data collection interfaces. The first interface implemented a traditional class-based approach to information modeling. In this interface, users were required to classify observed organisms at the biological species level by selecting from a list of species predetermined by the biology experts based on intended to uses to support on-going research agenda. In contrast, the second interface implemented an instance-based approach to information modeling. In this interface, users were asked to provide attributes of observed organisms and (optionally) classes to which they believed the observed organism belonged. Users were assigned to one of the two interfaces upon registration and remained in the assigned condition for the duration of the experiment (June – December 2013). We hypothesized that instance-based data collection would lead to higher information completeness manifested in a greater number of biological organisms reported by the users of the instance-based interface. In addition, we also predicted that not requiring users to report information based on a predefined schema would lead to greater novelty (i.e., not existent in the original schema) in the biological species reported – demonstrating the potential of use-agnostic IS in facilitating discoveries and producing data of immediate relevance and importance to data consumers (here, scientists). Consistent with the predictions, users in the instance-based interface provided 390 observations of biological instances, compared with only 87 observations made in the class-based interface. Users in both conditions provided 997 attributes and classes including 87 in the class-based and 910 in the instance-based condition. Of these 701 attributes and classes were new - they did not exist in the initial schema and were suggested by users as additions, including 119 new specieslevel in the instance-based and 7 in the class-based condition (suggested via comments to an observation). The results are statistically significant and support our propositions. Notably, several sightings of biological importance were reported during the experiment, including unanticipated distribution of species, a mosquito alien to the geographic area of the study, and a discovery of a possibly new species of wasp (presently pending scientific verification). It is also notable that of the top 5 contributors, 4 belonged to the instance-based condition - collectively producing 315 sightings - 80.8% of the observations in the instancebased condition and 66.0% of all the observations collected during the study period. Although the numbers are relatively low due to the narrow geographic focus and a niche nature of the project, they demonstrate clearly that prevailing approaches to modeling and information quality may routinely prevent relevant information from being collected and stored. The results
3
establish a novel connection between information modeling and information quality and suggest a new mechanism for increasing IQ.
Conclusion As organizations invite diverse and unpredictable UGC into internal decision making, they face the challenge of managing the quality of such datasets. Using field experimentation allowed us to demonstrate in a real setting that a use-agnostic approach to data collection can improve the quality of UGC. Field experimentation further enabled us to implement a nascent IS theory in the context of an established discipline of biology and affect existing practices in this discipline in an immediate and tangible way. As more disciplines begin to rely on information technologies for analysis and data production, IS researchers can leverage field experimentation to better understand, intervene and improve live information technology practices while exporting IS theories to other research communities.
References Arazy, O., O. Nov, R. Patterson, and L. Yeo. "Information Quality in Wikipedia: The Effects of Group Composition and Task Conflict," Journal of Management Information Systems (27:4), Spring2011, 2011, pp. 71-98. Barbier, G., R. Zafarani, H. Gao, G. Fung, and H. Liu. "Maximizing Benefits from Crowdsourced Data," Computational and Mathematical Organization Theory (18:3), 2012, pp. 257-279. Barwise, P., and S. Meehan. "The One Thing You must Get Right when Building a Brand," Harvard Business Review (88:12), 12, 2010, pp. 80-84. Coleman, D. J., Y. Georgiadou, and J. Labonte. "Volunteered Geographic Information: The Nature and Motivation of Producers," International Journal of Spatial Data Infrastructures Research (4:1), 2009, pp. 332-358. Culnan, M. J., P. J. McHugh, and J. I. Zubillaga. "How Large U.S. Companies can use Twitter and Other Social Media to Gain Business Value." MIS Quarterly Executive (9:4), 2010, pp. 243. de Boer, V., M. Hildebrand, L. Aroyo, P. De Leenheer, C. Dijkshoorn, B. Tesfa, and G. Schreiber. "Nichesourcing: Harnessing the Power of Crowds of Experts," Knowledge Engineering and Knowledge Management, Annette ten Teije, Johanna Völker, H, Siegfried schuh, Heiner Stuckenschmidt, Mathieu d’Acquin, Andriy Nikolov, Nathalie Aussenac-Gilles, Hern and Nathalie ez (eds.), Springer Berlin / Heidelberg, 2012, pp. 16-20. Doan, A., R. Ramakrishnan, and A. Y. Halevy. "Crowdsourcing Systems on the World-Wide Web," Communications of the ACM (54:4), 2011, pp. 86-96. Flanagin, A., and M. Metzger. "The Credibility of Volunteered Geographic Information," GeoJournal (72:3), 2008, pp. 137-148. Haklay, M. "How Good is Volunteered Geographical Information? A Comparative Study of OpenStreetMap and Ordnance Survey Datasets," Environment and Planning B: Planning and Design (37:4), 2010, pp. 682-703. Hand, E. "People Power," Nature (466:7307), Aug 5, 2010, pp. 685-687. Hochachka, W. M., D. Fink, R. A. Hutchinson, D. Sheldon, W. Wong, and S. Kelling. "DataIntensive Science Applied to Broad-Scale Citizen Science," Trends in Ecology & Evolution (27:2), 2012, pp. 130-137.
4
Lee, Y. W., and D. M. Strong. "Knowing-Why about Data Processes and Data Quality," Journal of Management Information Systems (20:3), 2003, pp. 13-39. Lukyanenko, R., J. Parsons, and Y. Wiersma. "The IQ of the Crowd: Understanding and Improving Information Quality in Structured User-Generated Content," Information Systems Research (in press), pp. 1-34. Lukyanenko, R., and J. Parsons. "Is Traditional Conceptual Modeling Becoming Obsolete?" Conceptual Modeling: ER'2013. 2013, pp. 1-14. Lukyanenko, R., and J. Parsons. "Conceptual Modeling Principles for Crowdsourcing," Proceedings of the 1st International Workshop on Multimodal Crowd Sensing2012, pp. 3-6. ———. "Information Loss in the Era of User-Generated Data," Pre-ICIS SIG IQ. 2011, pp. 1-6. Lukyanenko, R., J. Parsons, and Y. Wiersma. "Citizen Science 2.0: Data Management Principles to Harness the Power of the Crowd," Service-Oriented Perspectives in Design Science Research, Hemant Jain, Atish Sinha and Padmal Vitharana (eds.), Springer Berlin / Heidelberg, 2011, pp. 465-473. Parsons, J., R. Lukyanenko, and Y. Wiersma. "Understanding Information Quality in Crowdsourced Data," Winter Conference on Business Intelligence. 2014, pp. 1-3. Parsons, J., R. Lukyanenko, and Y. Wiersma. "Easier Citizen Science is Better," Nature (471:7336), 2011, pp. 37-37. Parsons, J., and Y. Wand. "Using Cognitive Principles to Guide Classification in Information Systems Modeling," MIS Quarterly (32:4), Dec, 2008, pp. 839-868. Parsons, J., and R. Lukyanenko. "Reconceptualizing Data Quality as an Outcome of Conceptual Modeling Choices," 10th Symposium on Research in Systems Analysis and Design. 2011. Parsons, J., and Y. Wand. "Choosing Classes in Conceptual Modeling," Communications of the ACM (40:6), Jun, 1997, pp. 63-69. ———. "Emancipating Instances from the Tyranny of Classes in Information Modeling," ACM Transactions on Database Systems (25:2), 2000, pp. 228–268. Redman, T. C. Data Quality for the Information Age, Artech House, Norwood, MA, 1996. Silvertown, J. "Taxonomy: Include Social Networking," Nature (467:7317), 2010, pp. 788-788. Whitla, P. "Crowdsourcing and its Application in Marketing Activities," Contemporary Management Research (5:1), 2009, pp. 15-28.
5