Data Science for the Massesâ, http://arxiv.org/abs/0909.3895 ... database of volunteer science results gathered by the Galaxy Zoo citizen science project. We.
Mining the Galaxy Zoo Database: Machine Learning Applications Kirk D. Borne1, A. Vedachalam1, S. Baehr1, D. Sponseller1, J. Wallin2, C. Lintott3, D. Darg3, A. Smith3, L. Fortson4 1George Mason Univ., 2Middle Tennessee state University, 3Oxford University, 3University of Minnesota 1 http://mergers.galaxyzoo.org/
and
2 http://www.galaxyzoo.org/
and
3 http://zooniverse.org/
SUMMARY : The new Zooniverse initiative is addressing the data flood in the sciences through a transformative partnership between professional scientists, volunteer citizen scientists, and machines. As part of this project, we are exploring the application of machine learning techniques to data mining problems associated with the large and growing database of volunteer science results gathered by the Galaxy Zoo citizen science project. We describe the basic challenge, some machine learning approaches, and early results. One of the motivators for this study is the acquisition (through the Galaxy Zoo results database) of over 100 million classification labels for roughly one million galaxies, yielding a tremendously large and rich set of training examples for improving the automated galaxy morphological classification algorithms used in large sky survey data processing pipelines. In our first case study, the goal is to learn which morphological and photometric features in the SDSS (Sloan Digital Sky Survey) database correlate most strongly with user-selected merging and interacting galaxies. As a corollary to this study, we aim to identify which galaxy parameters in the SDSS science catalog best correspond to galaxies that have been the most difficult to classify (based upon large dispersion in their volunteer-provided classifications). The outcomes of this project will have applications in future large sky surveys, such as the LSST (Large Synoptic Survey Telescope) project, which will generate a catalog of 20 billion galaxies and will produce an additional astronomical alert database of approximately 100 thousand events each night for 10 years -- the capabilities and algorithms that we are exploring will assist in the rapid characterization and classification of such massive data streams. (This research has been supported in part through NSF award #0941610 and in part by NASA through the American Astronomical Society's Small Research Grant Program.)
Galaxy Zoo: http://www.galaxyzoo.org/ You can help us to classify a million galaxies! •
•
Key Feature of Zooniverse:
http://zooniverse.org/
Data mining from the volunteer-contributed labels
• NSF-funded CDI grant
“Welcome to Galaxy Zoo, the project which harnesses the power of the internet - and your brain - to classify a million galaxies. By taking part, you'll not only be contributing to scientific research, but you'll view parts of the Universe that literally noone has ever seen before and get a sense of the glorious diversity of galaxies that pepper the sky.” “Why do we need you? – The simple answer is that the human brain is much better at recognizing patterns than a computer can ever be. Any computer program we write to sort our galaxies into categories would do a reasonable job, but it would also inevitably throw out the unusual, the weird and the wonderful. To rescue these interesting systems which have a story to tell, we need you.”
(PI: L.Fortson [U. Minnesota]; co-PIs J. Wallin [MTSU], K.Borne [GMU], C. Lintott [Oxford])
• Train the automated pipeline classifiers with:
• Building a framework for new Citizen Science projects, including user-based research tools • Science domains:
– Improved classification algorithms – Better identification of anomalies – Fewer classification errors
– Astronomy (Galaxy Classification, Galaxy Mergers, Supernova Search) – The Moon (Lunar Reconnaissance Orbiter) – The Sun (STEREO dual spacecraft) – Egyptology (the Papyri Project) – and more (… accepting proposals from community)
• Exploits the cognitive abilities of Human Computation! • Novel mode of data collection: – Citizen Science! = Volunteer Science – e.g., VGI = Volunteer Geographic Information (Goodchild ’07) – e.g., Galaxy Zoo @ http://www.galaxyzoo.org/
• Citizen science refers to the involvement of volunteer non-professionals in the research enterprise. The experience must be engaging, must work with real scientific data/information, must address authentic science research questions that are beyond the capacity of science teams and enterprises, and must involve the scientists.
Examples of Volunteer (Citizen) Science Audubon Bird Counts Project Budburst Stardust@Home VGI (Volunteer Geographic Information) Galaxy Zoo (~30 refereed pubs so far…) Galaxy Mergers Zoo Zooniverse (buffet of Zoos) U-Science (semantic science 2.0) [ref: Borne 2009]
http://zooniverse.org/ • Galaxy Zoo (released July 2007): – http://www.galaxyzoo.org/ – Classify galaxies (Spiral, Elliptical, Merger, or image artifact)
• Galaxy Merger Zoo (released November 2009) – http://mergers.galaxyzoo.org/ – Run N-body simulations to find best model to match a real merger
Tags produce a new data flood • Tagging enables semantic data fusion & integration, and knowledge acquisition/representation/sharing. • User-contributed content adds more data to the data flood. • Example – Galaxy Zoo project: – ~250,000 participants (and growing) – ~1 million galaxies have been labeled (classified) – ~150 million classifications have been collected
• Tagging is applicable to any data source, including document repositories – adding lightweight semantics to the data repository (taxonomies, folksonomies, annotations)
• Solar Storm Watch (released March 2010) – http://solarstormwatch.com/ – Spot solar storms (CMEs) in near real-time
•
• Moon Zoo (released Summer 2010)
Attributes tested
– Users (see paper: http://arxiv.org/abs/0909.2925 ) – Uncertainty Quantification (UQ) – Classification certainty vs. Classification dispersion
• The Hunt for Supernovae (released December 2009) – http://supernova.galaxyzoo.org/ – Real-time event detection and classification
First Case Study: test SDSS science catalog attributes to find which attributes correlate most strongly with user-classified mergers.
• Millions of training examples (V&V) • Hundreds of millions of class labels • Statistics deluxe! …
The Zooniverse: a Buffet of Zoos
Citizen Science
• • • • • • • •
The Zooniverse
Results of Decision Tree Information Gain analysis
Reference: “TagLearner: A P2P Classifier Learning System from Collaboratively Tagged Text Documents”, Dutta, Zhu, Mahule, Kargupta, Borne, Lauth, Holz, & Heyer, 2009 ICDM paper.
– Identify “interesting” features – cave lights, stone arches, boulder tracks – http://www.moonzoo.org/
• Citizen Science exploits the cognitive abilities of Human Computation! • Citizen Science provides a novel mode of data collection. • Citizen science refers to the involvement of volunteer non-professionals in the research enterprise. The experience must be engaging, must work with real scientific data/information, must address authentic science research questions that are beyond the capacity of science teams and enterprises, and must involve the scientists. • GalaxyZoo helps scientists by engaging the public (>250,000) to classify millions of galaxies: ~150 million classifications so far = Spiral Galaxy or Elliptical Galaxy or Merging Galaxies
– includes Biodas.org, Wikiproteins, HPKB, HydroTagger, AstroDAS – Ubiquitous, User-oriented, User-led, Untethered, You-centric Science
Problem statement for Zooniverse Machine Learning (Data Mining) challenge problems
Results of cluster separation analysis
• Current data environment (today):
– The Galaxy Zoo citizen science project has yielded ~150 million labels (classifications) for ~one million galaxies. – Current (distributed vertically partitioned) astronomical databases provide ~800 additional attributes for these galaxies.
• Future data environment (starting 2016): – The 10-year LSST sky survey science database will provide ~200 attributes for 50 billion objects, each measured 1000 times (=10 yrs). – The other astronomical databases provide ~800 attributes for about 100 million of those objects.
• Challenges: (incl. Unsupervised, Semisupervised, and Supervised Learning) – Develop scalable algorithms that identify the most scientifically meaningful and statistically significant discoveries among the ~1000! (factorial!) combinations of attributes and user-provided labels, incl.: • Correlations, patterns, associations, outliers/novelties, “best” classification rules
LSST in time and space:
LSST Data Mining and Machine Learning Challenges: • Massive data stream: ~2 Terabytes of image data per hour to be mined in real time (growing to ~100 Petabytes in 10 years) • Massive 20-Petabyte database: more than 50 billion objects need to be classified, and most will be monitored for important variations in real time. • Massive event stream: knowledge extraction in real time for 100,000 events each night.
Related References • • • •
• • • •
Borne (2009): “U-Science”, http://essi.gsfc.nasa.gov/pdf/Borne2.pdf Borne, Jacoby, …, Wallin (2009): “The Revolution in Astronomy Education: Data Science for the Masses”, http://arxiv.org/abs/0909.3895 Borne (2009): “Astroinformatics: A 21st Century Approach to Astronomy”, http://arxiv.org/abs/0909.3892 Dutta, Zhu, Mahule, Kargupta, Borne, Lauth, Holz, & Heyer (2009): “TagLearner: A P2P Classifier Learning System from Collaboratively Tagged Text Documents”, accepted paper for ICDM-2009. M. F. Goodchild (2007): “Citizens as Sensors: the World of Volunteered Geography”, GeoJournal, 69, pp. 211-221. Lintott et al. (2009): “Galaxy Zoo: 'Hanny's Voorwerp', a quasar light echo?”, http://arxiv.org/abs/0906.5304 Raddick et al. (2009): “Galaxy Zoo: Exploring the Motivations of Citizen Science Volunteers”, http://arxiv.org/abs/0909.2925 Raddick, Bracey, Carney, Gyuk, Borne, Wallin, & Jacoby (2009): “Citizen Science: Status and Research Directions for the Coming Decade”, http://www8.nationalacademies.org/astro2010/DetailFileDisplay.aspx?id=454
– When? 2018-2028 – Where? Cerro Pachon, Chile
LSST Key Science Drivers: Mapping the Universe – Solar System Map (moving objects, NEOs, asteroids: census & tracking) – Nature of Dark Energy (distant supernovae, weak lensing, cosmology) – Optical transients (of all kinds, with alert notifications within 60 seconds) – Galactic Structure (proper motions, stellar populations, star streams)
Camera Specs: (pending funding from the DOE) 201 CCDs @ 4096x4096 pixels each! = 3 Gigapixels = 6 GB per image, covering 10 square degrees, every 20 secs = ~3000 times the area of one Hubble Telescope image