Increasing quality of citizen science data

7 downloads 11469 Views 1MB Size Report
Illustration of the problem. Great Egret ... Challenge to select predefined classes for crowds. • N=66 business students ... Class-based condition (species-only).
Increasing quality of citizen science data Oceans of Data MEOPAR Workshop. Oct 24, 2014, St. John’s, NL CANADA Roman Lukyanenko Jeffrey Parsons Yolanda Wiersma Florida International University - Memorial University of Newfoundland

Notes • This presentation is based on conceptual and empirical work done by the authors, including:  Conceptual: Parsons et al. 2011; Lukyanenko and Parsons 2011; Lukyanenko and Parsons 2011  Empirical: Lukyanenko et al. 2014a; Lukyanenko et al. 2014b

Quality of UGC

Research Problem • Major challenge in making effective use of citizen data is crowd IQ  E.g., accuracy of a citizen science observation on eBird.org

Gura 2013 Nature

3

Quality of UGC

4

Existing approaches • Popular approaches to crowd IQ  Educate and train online users  Provide data collection instructions  “Clean” data post-hoc

• Focus on data consumers (e.g., scientists)  Dissuade contributors from providing data  Prevent contributors from communicating important local knowledge

5

Contributor-centric, use-agnostic IQ

• IQ from contributors’ perspective  Crowd IQ: “the extent to which stored information represents the phenomena of interest to data consumers, as perceived by information contributors” • Use-agnostic • Contributor-centric • Cognizant of data consumers 6

Illustration of the problem Great Egret

Incorrect guess ↓ accuracy

Snowy Egret White Ibis

Avoid participating ↓ dataset completeness

Any choice (incl. correct) ↓ instance completeness (attribute loss)

7

Experiment 1: Free form • N=247 non-experts (141 female, 106 male)  24 full-color images of plants and animals

 Condition 1: Classify it: What is it?  Condition 2: Describe it using attributes / features

• Free-form responses

8

Results: H-1.1: Accuracy Useful classes (e.g., great egret):  141 total  27 (19.15%) correct

Using classes useful to data consumers

Familiar classes (e.g., bird)  1550 total  1523 (98.26%) correct

avg. p < 0.001*

Accuracy

Using classes familiar to crowd contributors * Based on Fisher’s exact test; Sig with Bonferroni correction 9

Results: H-1.2: Information Loss

• Analysis of attributes:  6,429 attributes are below basic level • E.g., gray beak, deformed fin, looks sick

 685 attributes at the basic level • E.g., can fly, has feathers

Using classes useful to data consumers

Using classes familiar to crowd contributors

avg. p < 0.001*

Instance Completeness

* Based on χ2 test; Sig with Bonferroni correction 10

Experiment 3 • Impact of imposing structure on accuracy  Challenge to select predefined classes for crowds

• N=66 business students (non-experts) “Useful” Condition

“Familiar” Condition

What is it? Select one:

What is it? Select one:

o Arctic Tern o Bonaparte's Gull o Caspian Tern o… o Common Tern o I don’t know o Other ___

o Animal o Tern o… o Waterfowl o Bird o I don’t know o Other ___

Free-form What is it? Write one:

Items in bold are correct 11

Results: H-3.2: Accuracy “Useful”

“Familiar”

Free-form

% Accuracy

35.5

66.7

77.3

% Basic-level

0.4

33.3

52.2

Class-based model with familiar options

avg. p < 0.05* Accuracy

Free-form data collection



Accuracy does not necessarily increase when “familiar” options are included in a predefined schema

* Based on χ2 test 12

Findings from Lab. Studies • Data quality dilemma of citizen science 

Accuracy declines when classes are driven by data consumer needs  Accuracy increases when classes are familiar to contributors  But using such classes results in significant attribute loss  Potential for lower accuracy when using “familiar” classes

13

Principles of use-agnostic citizen science

• Allow contributor to define what data is relevant • To guide input, capture data as individuals  Variable attributes  Variable classes  Free-form descriptions

14

Demonstration of the principles • NLNature – a real citizen science IS

Scientists then use data to:

Observe wildlife

www.nlnature.com

15

Demonstration of the principles

16

Demonstration of the principles

17

Impact on dataset completeness / participation • Field Experiment using NLNature  Class-based condition (species-only)  Instance-based condition

• Hypotheses:  H-4.1 More instances observed in the instancebased condition  H-4.2 More instances of novel (i.e., not present in existing schema) species in the instancebased condition

18

Results: H-4.1 • Period: June to Dec (6 months) No of users in condition

No of observations

Class-based

42

87

Instance-based

39

390

Condition

Class-based model

avg. p < 0.01* Dataset completeness -instances stored

Instance-based model * Based on permutation test 19

Results: H-4.2 • Period: June to Dec (6 months) Condition

No of users in condition

No of new species

Class-based

42

7

Instance-based

39

119

Class-based model

avg. p < 0.01* Dataset completeness -novel species stored

Instance-based model * Based on permutation test 20

Summary • Prevailing class-based approaches may result in  Lower quality  Lower participation, and  Preclude discovery of new classes of instances

• Potential value of the proposed useagnostic, instance-based citizen science • “Easier citizen science is better”

21

References: Conceptual • Parsons, J., Lukyanenko, R., and Wiersma, Y. (2011). Easier citizen science is better. Nature, 471 (7336), pp. 37-37. • Lukyanenko, R. and Parsons J. (2012). Conceptual modeling principles for crowdsourcing. Conference on Information and Knowledge Management (CIKM) CrowdSense Workshop. Association for Computing Machinery (ACM) Press, New York, NY, pp. 3-6. • Lukyanenko, R., Parsons, J., and Wiersma, Y. (2011). Citizen Science 2.0: Data Management Principles to Harness the Power of the Crowd. In H. Jain, A. Sinha & P. Vitharana (Eds.), International Conference on Design Science Research in Information Systems and Technologies (DESRIST 2011), Lecture Notes on Computer Science Vol. 6629, Springer Berlin / Heidelberg. pp. 465-473.

Quality of UGC

22

References: Empirical • Lukyanenko, R., Parsons, J., and Wiersma, Y. (2014a). The IQ of the Crowd: Understanding and Improving Information Quality in Structured User-generated Content. Information Systems Research. 25 (4) pp. 669689. • Lukyanenko, R., Parsons, J., and Wiersma, Y. (2014b). The Impact of Conceptual Modeling on Dataset Completeness: A Field Experiment. International Conference on Information Systems, 19 pp.

Quality of UGC

23

THANK YOU!