Aug 8, 2014 - Increasingly, organizations turn to data produced outside org. boundaries. â« Social media, crowdsourcing facilitate user- generated content ...
An Information Modeling Approach to Improve Quality of User-generated Content Roman Lukyanenko Faculty of Business Administration Memorial University of Newfoundland August 8, 2014 FACULTY OF BUSINESS ADMINISTRATION
WWW.BUSINESS.MUN.CA
Outline • Background and motivation • Research Problem Information Quality (IQ) in User-generated content (UGC) Limitations of existing approaches
• Proposed approach Contributor-centric, use-agnostic IQ Conceptual modeling as a factor of crowd IQ
• Theoretical propositions • Empirical evidence Impact of conceptual modeling on accuracy and information loss
• Principles for modeling UGC Demonstration of the proposed principles Impact of conceptual modeling on dataset completeness
• Contributions and future research
Quality of UGC
2
Background and Motivation • Traditionally, IS are used in well-controlled information production settings • Increasingly, organizations turn to data produced outside org. boundaries Social media, crowdsourcing facilitate usergenerated content (UGC): • Various forms of digital information produced by members of the general public – often casual content contributors (the crowd) – rather than by employees or others closely associated with an organization
Quality of UGC
3
Harnessing UGC • UGC supports decision making and operations Businesses • better understand customers, develop products
Health care • telemedicine, doctor reviews
Government • public services, disaster management
Scientific research • citizen science
Quality of UGC
4
Example: Citizen Science
eBird.org
Quality of UGC
5
Research Problem: IQ in UGC • Major challenge in making effective use of UGC is crowd IQ E.g., accuracy of a citizen science observation on eBird.org
Gura 2013 Nature Quality of UGC
6
Limitations of existing approaches
• Traditional approach is ‘fitness for use’ • Popular approaches to crowd IQ Educate and train online users Provide data collection instructions “Clean” data post-hoc
• Focus on data consumers Dissuade contributors from providing data Prevent contributors from communicating important situational knowledge
Quality of UGC
7
Proposed: Contributor-centric, useagnostic IQ
• IQ from contributors’ perspective Crowd IQ: “the extent to which stored information represents the phenomena of interest to data consumers, as perceived by information contributors” • Use-agnostic, contributor-centric • Cognizant of data consumers
• How to design IS sensitive to contributors? Rethink approaches to conceptual modeling Quality of UGC
8
Proposed: Conceptual modeling as a factor of crowd IQ
• Conceptual modeling “describing some aspects of the physical and social world around us for the purposes of understanding and communication” (Mylopoulos 1992)
Quality of UGC
9
Research questions • Research question 1 How does conceptual modeling affect IQ in UGC settings?
• Research question 2 What conceptual modeling principles can be developed to improve crowd IQ?
Quality of UGC
10
Connection between modeling and IQ
• Traditionally: modeling facilitates intended uses via predefined abstractions (e.g., classes) • Ontology, cognition World is made of unique individuals • Class-based models capture common rather than unique attributes of individuals
Classes are observer-dependent and use-driven • Crowd contributors and data consumers may not share classes in a domain
Quality of UGC
11
Illustration of theoretical propositions Great Egret Bird
Incorrect guess → ↓ accuracy
Snowy Egret Tree White FishIbis
P1
Accuracy, completeness undermined when classes are unfamiliar to contributors
P2
Information loss increases when classes are familiar to contributors
Quality of UGC
Avoid participating → ↓ dataset completeness
Any choice (incl. correct) → ↓ instance completeness (attribute loss)
12
Impact on Accuracy and Information Loss
• Three laboratory experiments Potential data contributors, biology non-experts
• How to determine “familiar” classes for the anonymous crowds? Psychology: basic-level categories
• Two class-based models: “Useful” (biological species) “Familiar” (basic-level categories)
Quality of UGC
13
Experiment 1: Free form • N=247 non-experts (141 female, 106 male) 24 full-color images of plants and animals
Condition 1: Classify it: What is it? Condition 2: Describe it using attributes / features
• Free-form responses
Quality of UGC
14
Experiment 1: Hypotheses Using classes useful to data consumers
Accuracy
Using classes familiar to data contributors
Instance Completeness
• H-1.1: Accuracy. Non-experts will classify instances with fewer errors at the basic level than at the species-genus level
• H-1.2: Instance Completeness. Non-experts will describe instances in terms of attributes subordinate to the basic level grey beak, yellow belly vs. can fly, has feathers Quality of UGC
15
Results: H-1.1: Accuracy Useful classes (e.g., great egret): 141 total 27 (19.15%) correct
Using classes useful to data consumers
Familiar classes (e.g., bird) 1550 total 1523 (98.26%) correct
avg. p < 0.001*
Accuracy
Using classes familiar to crowd contributors * Based on Fisher’s exact test; Sig with Bonferroni correction Quality of UGC
16
Results: H-1.2: Instance Completeness
• Analysis of attributes: 6,429 attributes are below basic level • E.g., gray beak, deformed fin, looks sick
685 attributes at the basic level • E.g., can fly, has feathers
Using classes useful to data consumers
Using classes familiar to crowd contributors
avg. p < 0.001*
Instance Completeness
* Based on χ2 test; Sig with Bonferroni correction Quality of UGC
17
Experiment 2: Fixed-choice • Direct test of data entry with predefined classes • N=77 non-experts • Task: select class from predefined list
Quality of UGC
18
Experiment 2: Materials “Useful” Condition
“Familiar” Condition
What is it? Select one:
What is it? Select one:
o Arctic Tern o Bonaparte's Gull o Caspian Tern o Common Tern o Herring Gull o Iceland Gull o Killdeer o Parasitic jaeger o Pomarine jaeger o I don’t know o Other ___
o Animal o Common Tern o Iceland Gull o Killdeer o Seagull o Shorebird o Tern o Waterfowl o Bird o I don’t know o Other ___
Cognitive psychology
Items in bold are correct Quality of UGC
19
Results: H-2.1: Accuracy Useful classes (e.g., great egret): 271 total 73 (24.84%) correct
Using classes useful to data consumers
Familiar classes (e.g., bird) 375 total 277 (73.88%) correct
avg. p < 0.01*
Accuracy
Using classes familiar to crowd contributors * Based on χ2 test; Sig with Bonferroni correction Quality of UGC
20
Experiment 3 • Impact of imposing structure on accuracy Challenge to select predefined classes for crowds
• N=66 business students (non-experts) “Useful” Condition
“Familiar” Condition
What is it? Select one:
What is it? Select one:
o Arctic Tern o Bonaparte's Gull o Caspian Tern o… o Common Tern o I don’t know o Other ___
o Animal o Tern o… o Waterfowl o Bird o I don’t know o Other ___
Free-form What is it? Write one:
Items in bold are correct Quality of UGC
21
Results: H-3.2: Accuracy “Useful”
“Familiar”
Free-form
% Accuracy
35.5
66.7
77.3
% Basic-level
0.4
33.3
52.2
Class-based model with familiar options
avg. p < 0.05* Accuracy
Free-form data collection
Accuracy does not necessarily increase when “familiar” options are included in a predefined schema
* Based on χ2 test
Quality of UGC
22
Findings from Lab. Studies • Conceptual modeling - important factor for crowd IQ • Dilemma of modeling in UGC:
Quality of UGC
Accuracy declines when classes are driven by data consumer needs Accuracy increases when classes are familiar to contributors But using such classes undermines instance completeness (i.e., results in significant attribute loss) Potential for lower accuracy when using “familiar” classes
23
Principles of modeling UGC • Modeling UGC should be based on user and use-invariant representations • Instances should be the primary construct in UGC • Attributes can be attached to an instance • Classes can be attached to an instance Instances 0…* 1…* 0….* 1…* Classes Attributes 0…* 0…* Quality of UGC
24
Demonstration of the principles • NLNature – a real citizen science IS
Scientists then use data to:
Observe wildlife
Quality of UGC
www.nlnature.com
25
Demonstration of the principles
Quality of UGC
26
Demonstration of the principles
Quality of UGC
27
Impact on dataset completeness
• Field Experiment using NLNature Class-based condition (species-only) Instance-based condition
• Hypotheses: H-4.1 More instances observed in the instancebased condition H-4.2 More instances of novel (i.e., not present in existing schema) species in the instancebased condition
Quality of UGC
28
Instance-based condition
Quality of UGC
29
Species-only condition
Quality of UGC
30
Results: H-4.1 • Period: June to Dec (6 months) No of users in condition
No of observations
Class-based
42
87
Instance-based
39
390
Condition
Class-based model
avg. p < 0.01* Dataset completeness -instances stored
Instance-based model * Based on permutation test Quality of UGC
31
Results: H-4.2 • Period: June to Dec (6 months) Condition
No of users in condition
No of new species
Class-based
42
7
Instance-based
39
119
Class-based model
avg. p < 0.01* Dataset completeness -novel species stored
Instance-based model * Based on permutation test Quality of UGC
32
Findings from Field Exp. • Conceptual modeling affects dataset completeness Prevailing class-based approaches may result in lower dataset completeness Existing IS may preclude discovery of new classes of instances
• Potential value of the proposed instancebased approach for modeling UGC
Quality of UGC
33
Contributions • Impact of conceptual modeling on information quality Prevailing class-based modeling may have detrimental impact on IQ (Lukyanenko et al. 2014) Antecedents of IQ: "a significant gap in the IS research” (Petter et al. 2013, p. 30)
• Contributor-oriented IQ • Instance-based conceptual modeling More effective ways to harness UGC Exemplar of an “[e]xciting ..work” exploring “new technological environments” (Goes MISQ Editorial 2014, p. vi)
Quality of UGC
34
Future work • Deeper understanding of the impact of modeling on IQ: Information loss Interaction between classification and familiarity • Contributor-oriented IQ management Impact on decision making (in-progress: study with data consumers)
• Beyond citizen science Corporate settings Health IS, telemedicine
Quality of UGC
35
Future work (cont’d) • Extending instance-based approach to conceptual modeling How to combine it with traditional modeling? Do we need “instance-based” grammars? • Lukyanenko and Parsons 2013a
How to better manage attribute-based data collection Implications for user interfaces • Lukyanenko and Parsons 2013b
Quality of UGC
36
References Goes, P. B. (2014). Editor's comments: design science research in top information systems journals. MIS Quarterly, 38(1), iii-viii. Gura, T. (2013). Citizen science: amateur experts. Nature, 496(7444), 259-261. Lukyanenko, R. and Parsons J. (2013a). Is Traditional Conceptual Modeling Becoming Obsolete? In W. Ng, V.C. Storey, and J. Trujillo (Eds.), International Conference on Conceptual Modeling (ER 2013), Lecture Notes on Computer Science Vol. 8217, Springer, Heidelberg. pp. 61-73. Lukyanenko, R. and Parsons, J. (2013b). Reconciling theories with design choices in design science research. In J. vom Brocke et al. (Eds.), International Conference on Design Science Research in Information Systems and Technologies (DESRIST 2013), Lecture Notes on Computer Science Vol. 7939, Springer Berlin / Heidelberg. pp. 165-180. Lukyanenko, R., Parsons, J., & Wiersma, Y. F. (2014). The IQ of the Crowd: Understanding and Improving Information Quality in Structured UserGenerated Content. Information Systems Research, 25(4), 669-689. Petter, S., DeLone, W. and McLean, E. 2013. Information Systems Success: The Quest for the Independent Variables, JMIS, 29 (4), pp. 7-62, p. 30
Quality of UGC
THANK YOU! FACULTY OF BUSINESS ADMINISTRATION
WWW.BUSINESS.MUN.CA