Illustration of the problem. Great Egret ... Challenge to select predefined classes for crowds. ⢠N=66 business students ... Class-based condition (species-only).
Increasing quality of citizen science data Oceans of Data MEOPAR Workshop. Oct 24, 2014, St. John’s, NL CANADA Roman Lukyanenko Jeffrey Parsons Yolanda Wiersma Florida International University - Memorial University of Newfoundland
Notes • This presentation is based on conceptual and empirical work done by the authors, including: Conceptual: Parsons et al. 2011; Lukyanenko and Parsons 2011; Lukyanenko and Parsons 2011 Empirical: Lukyanenko et al. 2014a; Lukyanenko et al. 2014b
Quality of UGC
Research Problem • Major challenge in making effective use of citizen data is crowd IQ E.g., accuracy of a citizen science observation on eBird.org
Gura 2013 Nature
3
Quality of UGC
4
Existing approaches • Popular approaches to crowd IQ Educate and train online users Provide data collection instructions “Clean” data post-hoc
• Focus on data consumers (e.g., scientists) Dissuade contributors from providing data Prevent contributors from communicating important local knowledge
5
Contributor-centric, use-agnostic IQ
• IQ from contributors’ perspective Crowd IQ: “the extent to which stored information represents the phenomena of interest to data consumers, as perceived by information contributors” • Use-agnostic • Contributor-centric • Cognizant of data consumers 6
Illustration of the problem Great Egret
Incorrect guess ↓ accuracy
Snowy Egret White Ibis
Avoid participating ↓ dataset completeness
Any choice (incl. correct) ↓ instance completeness (attribute loss)
7
Experiment 1: Free form • N=247 non-experts (141 female, 106 male) 24 full-color images of plants and animals
Condition 1: Classify it: What is it? Condition 2: Describe it using attributes / features
• Free-form responses
8
Results: H-1.1: Accuracy Useful classes (e.g., great egret): 141 total 27 (19.15%) correct
Using classes useful to data consumers
Familiar classes (e.g., bird) 1550 total 1523 (98.26%) correct
avg. p < 0.001*
Accuracy
Using classes familiar to crowd contributors * Based on Fisher’s exact test; Sig with Bonferroni correction 9
Results: H-1.2: Information Loss
• Analysis of attributes: 6,429 attributes are below basic level • E.g., gray beak, deformed fin, looks sick
685 attributes at the basic level • E.g., can fly, has feathers
Using classes useful to data consumers
Using classes familiar to crowd contributors
avg. p < 0.001*
Instance Completeness
* Based on χ2 test; Sig with Bonferroni correction 10
Experiment 3 • Impact of imposing structure on accuracy Challenge to select predefined classes for crowds
• N=66 business students (non-experts) “Useful” Condition
“Familiar” Condition
What is it? Select one:
What is it? Select one:
o Arctic Tern o Bonaparte's Gull o Caspian Tern o… o Common Tern o I don’t know o Other ___
o Animal o Tern o… o Waterfowl o Bird o I don’t know o Other ___
Free-form What is it? Write one:
Items in bold are correct 11
Results: H-3.2: Accuracy “Useful”
“Familiar”
Free-form
% Accuracy
35.5
66.7
77.3
% Basic-level
0.4
33.3
52.2
Class-based model with familiar options
avg. p < 0.05* Accuracy
Free-form data collection
Accuracy does not necessarily increase when “familiar” options are included in a predefined schema
* Based on χ2 test 12
Findings from Lab. Studies • Data quality dilemma of citizen science
Accuracy declines when classes are driven by data consumer needs Accuracy increases when classes are familiar to contributors But using such classes results in significant attribute loss Potential for lower accuracy when using “familiar” classes
13
Principles of use-agnostic citizen science
• Allow contributor to define what data is relevant • To guide input, capture data as individuals Variable attributes Variable classes Free-form descriptions
14
Demonstration of the principles • NLNature – a real citizen science IS
Scientists then use data to:
Observe wildlife
www.nlnature.com
15
Demonstration of the principles
16
Demonstration of the principles
17
Impact on dataset completeness / participation • Field Experiment using NLNature Class-based condition (species-only) Instance-based condition
• Hypotheses: H-4.1 More instances observed in the instancebased condition H-4.2 More instances of novel (i.e., not present in existing schema) species in the instancebased condition
18
Results: H-4.1 • Period: June to Dec (6 months) No of users in condition
No of observations
Class-based
42
87
Instance-based
39
390
Condition
Class-based model
avg. p < 0.01* Dataset completeness -instances stored
Instance-based model * Based on permutation test 19
Results: H-4.2 • Period: June to Dec (6 months) Condition
No of users in condition
No of new species
Class-based
42
7
Instance-based
39
119
Class-based model
avg. p < 0.01* Dataset completeness -novel species stored
Instance-based model * Based on permutation test 20
Summary • Prevailing class-based approaches may result in Lower quality Lower participation, and Preclude discovery of new classes of instances
• Potential value of the proposed useagnostic, instance-based citizen science • “Easier citizen science is better”
21
References: Conceptual • Parsons, J., Lukyanenko, R., and Wiersma, Y. (2011). Easier citizen science is better. Nature, 471 (7336), pp. 37-37. • Lukyanenko, R. and Parsons J. (2012). Conceptual modeling principles for crowdsourcing. Conference on Information and Knowledge Management (CIKM) CrowdSense Workshop. Association for Computing Machinery (ACM) Press, New York, NY, pp. 3-6. • Lukyanenko, R., Parsons, J., and Wiersma, Y. (2011). Citizen Science 2.0: Data Management Principles to Harness the Power of the Crowd. In H. Jain, A. Sinha & P. Vitharana (Eds.), International Conference on Design Science Research in Information Systems and Technologies (DESRIST 2011), Lecture Notes on Computer Science Vol. 6629, Springer Berlin / Heidelberg. pp. 465-473.
Quality of UGC
22
References: Empirical • Lukyanenko, R., Parsons, J., and Wiersma, Y. (2014a). The IQ of the Crowd: Understanding and Improving Information Quality in Structured User-generated Content. Information Systems Research. 25 (4) pp. 669689. • Lukyanenko, R., Parsons, J., and Wiersma, Y. (2014b). The Impact of Conceptual Modeling on Dataset Completeness: A Field Experiment. International Conference on Information Systems, 19 pp.
Quality of UGC
23
THANK YOU!