Jan 9, 2014 ... Dave Schlegel (LBL) & Scott Burles (Cutler). ▷ Mike Blanton (NYU). ▷ Dustin
Lang (CMU) & Jo Bovy (IAS) &. Dan Foreman-Mackey (NYU) ...
Engineering considerations for large astrophysics projects David W. Hogg Center for Cosmology and Particle Physics Department of Physics New York University Max-Planck-Institut f¨ ur Astronomie Heidelberg, Germany
2014 January 9
punchlines I
Calibration programs are wasteful and reduce the accuracy of your end-of-mission results. I
I
Homogeneity and uniformity of survey samples are impossible, unnecessary, and harmful goals. I
I
(you will need to implement some probability theory)
Proper uncertainty propagation is not easy. I
I
(you will need to adjust your observing strategy)
(I got nothing)
The challenge is to make precise measurements and keep discovery space open. I
(you will need to understand, quantitatively, your goals)
my teachers (incomplete list)
I
Gerry Neugebauer (Caltech, emeritus)
I
Sam Roweis (Toronto & NYU, deceased)
I
Dave Schlegel (LBL) & Scott Burles (Cutler)
I
Mike Blanton (NYU)
I
Dustin Lang (CMU) & Jo Bovy (IAS) & Dan Foreman-Mackey (NYU)
survey-centric context
I
Gaia
I
SKA and pathfinders
I
Euclid
I
LSST
I
SDSS-IV . . . and many more
I
I
(I am going to get mean at the end.)
my day job
I
Astrometry.net and TheTractor
I
emcee and kplr
I
precision measurement, probabilistic inference
I
data-driven models
homogeneity and uniformity are impossible
I
weather
I
target selection
I
hardware evolution
I
efficiency considerations
probabilistic target selection
I
SDSS-III SDSS-III BOSS quasar target selection
I
in SDSS bandpasses, z ∼ 3 quasars look like A-type stars
I I
stars outnumber quasars enormously don’t have good models of either I
I
Bovy et al. arXiv:1011.6392
this target selection cannot be uniform I
I I
heterogeneous data quality means heterogeneous target selection star density varies on the sky suck it up!
homogeneity and uniformity are unnecessary
I
correct the data I
I I
I
forward modeling I I I I
I
compute inverse selection “volume” or probabilities 1/Vmax (ish) re-weight the data using these inverse volumes very wrong! write down uncensored p0 (data | parameters) multiply by (one minus) censoring rate η(data) renormalize to get expected p(data | parameters) this is a likelihood function
(but: visualizing a forward model)
estimators
I
Cram`er–Rao bound I
example: Gaia astrometry
I
likelihood principle(s)
I
it is our duty to analyze our very limited data with optimal methods the output of any data analysis must be a likelihood function
I
I
WMAP, Planck
likelihood principle
I
I said “function”.
I
p(data | parameters)
living the likelihood dream
I
don’t make a catalog of objects I I
I
that’s some kind of (probably inefficient) estimator even with error bars it can’t transmit the full information
produce a likelihood function in catalog space I I
Lang et al. http://TheTractor.org/ Brewer et al. arXiv:1211.5805
homogeneity and uniformity are unnecessary?
I I
special case of two-point functions (and higher orders) currently an unsolved problem I
(but papers from Wandelt’s group)
homogeneity and uniformity are harmful
I
can’t be uniform in everything I
the uniformity you choose only helps one of your customers!
I
uniform samples end up requiring a lot of time on the least useful objects
I
reduces the heterogeneity that is essential to calibration
self-calibration
I
final imaging calibration of SDSS I I
made no use at all of the calibration program data Padmanabhan et al. arXiv:astro-ph/0703454
calibration programs are wasteful
I
there are more photons in the science data I
I
I
therefore, the science data contain more information about calibration (exceptions abound)
you must take your data with proper heterogeneity! I I I
Kepler tiling patterns Holmes et al. arXiv:1203.6255
Sky Position β (deg)
4
A
B
C
D
2
0
−2 −4
Sky Position β (deg)
4
2
0
−2 −4 −4
−2
0
2
Sky Position α (deg)
4
−4
−2
0
2
Sky Position α (deg)
4
(c)
(d)
0.0 −0.2
0.2 0.1
−0.4
0.0
−0.6
−0.1 −0.2 −0.3
0.950 0.940
−0.3 −0.2 −0.1 0.0
0.1
0.930
0.2
Focal Plane Position x (deg)
0.3
−0.8 −0.3 −0.2 −0.1 0.0
0.1
0.2
Focal Plane Position x (deg)
0.920 0.910 0.930 0.940 0.95 0 0.96 0.97 0 0
(a)
0.3 0.2 0.990 0 1.00
0.1
0.3
(b)
0.910
0.940 0.95 0 0.96 0.97 0 0
0.2 0.990 0 1.00
0.1
0.2 0.890 0.920 0.930
0.3
(c)
(d)
−0.1
−0.4 −0.6
0.980
−0.2 −0.3
0.0 −0.2
0.0
0.940 0.930 0.920 0.910 0.900
−0.3 −0.2 −0.1 0.0
0.960 0.950
0.1
−0.8 0.890
0.2
Focal Plane Position x (deg)
0.3
−0.3 −0.2 −0.1 0.0
0.1
0.2
Focal Plane Position x (deg)
0.3
Focal Plane Position y (deg)
0.4
−0.2
−1.0
0.2
0.960 0.950 0.940 0.930 0.920
0.910
0.900
0.920 0.900 0.930
0.3
0.950
0.2
0.980
(c)
0.940
(d)
−0.4
0.0
−0.6
−0.1 −0.2
−0.8
0.960 0.950 0.940 0.930 0.920
0.910
0.900
0.900
0.1
0.890
0.2
0.3
−0.3 −0.2 −0.1 0.0
0.1
0.2
Focal Plane Position x (deg)
0.920 0.910 0.930 0.940 0.95 0 0.96 0.97 0 0
(a)
0.2 0.1
0.0 −0.2
0.96 0.97 0 0
0.990
0.1
0.890 0.910
0.3
Residuals (%)
0.940 0.930 0.920 0.910 0.900
−0.3
0.960 0.950
0.6
0.0
1.0
0.4
1.0 0.8
−0.1
Focal Plane Position x (deg)
0.6
(b)
0.96 0.97 0 0
−0.3 −0.2 −0.1 0.0
0.8
(a)
0.940
0.990
0.1
−0.3
0.980
−0.2
0.980
−0.3
−1.0
0.0 −0.1
0.950
0.2
Residuals (%)
0.930 0.960 0.970
0.910
0.920 0.930
0.3
0
0.99
0.3
(b)
−1.0 1.0 0.8
0 1.00
0.6
0.0 −0.1
0.4 0.980
−0.2 0.940 0.930 0.920 0.910 0.900
−0.3
0.960 0.950
0.910
0.940 0.95 0 0.96 0.97 0 0
0.2 0.1
0.2 0.890 0.920 0.930
0.3
0
0.99
(c)
(d)
−0.2
0 1.00
−0.4
0.0 −0.1
−0.6
0.980
−0.2 −0.3
0.0
0.940 0.930 0.920 0.910 0.900
−0.3 −0.2 −0.1 0.0
0.960 0.950
0.1
−0.8 0.890
0.2
Focal Plane Position x (deg)
0.3
−0.3 −0.2 −0.1 0.0
0.1
0.2
Focal Plane Position x (deg)
0.3
−1.0
Residuals (%)
0.950 0.940
0.2
Focal Plane Position y (deg)
−0.2
Residuals (%)
0.4
Focal Plane Position y (deg)
0.0 −0.1
0 0.95
Focal Plane Position y (deg)
1.0
0.6
0 0.95
Focal Plane Position y (deg)
0.1
0.3
Focal Plane Position y (deg)
(b)
0.8
−0.3
Focal Plane Position y (deg)
(a)
0.2
Focal Plane Position y (deg)
0.960 0.970
0.3
Self-calibration of imaging
I
A good survey: I I I I
I
every star appears in many images in different images, the star is in different places every image contains many stars Holmes et al. arXiv:1203.6255
Kepler and Spitzer exoplanet photometry is pessimal for self-calibration. . . I
. . . but for a very good reason!
target selection is classification
I
SDSS-III SDSS-III BOSS is taking spectra of quasars, not stars
I
stars outnumber (relevant) quasars by factors of hundreds
I
observations are noisy and theoretical models are incomplete
I
want to find only the quasars. . . or do we?
classification algorithms
I
Support Vector Machine, Random Forest, Artificial Neural Net I
I
all bad!
value of a causal model I I
I I
training and test samples don’t match need to classify new data taken under different conditions make use of our technical knowledge about the data. Bovy et al. arXiv:1011.6392
1-epoch
model
30-epoch
aside: discovery as classification
I
found an exoplanet? I I
I
utility arises I
I
That’s a model selection move. Bayes doesn’t tell you how to make decisions. Make decisions that maximize expected (scientific?) return.
Astrometry.net has an explicit utility model I I
I
Automatic calibration of an image successful? Our “customer model” is that they are offended by false positives. Lang et al. arXiv:0910.2233
utility considerations
I
might be worth taking a source unlikely to be a quasar, as long as it is likely to be interesting I I I I
I
need to be able to make these trade-offs quantitatively requires a specification of utility needs to be measured in dollars (or equivalent) long-term future discounted free cash flow
the “game” of proposal writing I I I
we aren’t honest in our proposals about what we want SDSS was over-designed by any measure that was valuable!
over-design
I
SDSS was seriously over-designed to measure the large-scale structure I I
I
I
(no-one thinks that was a bad idea) could have done all the large-scale structure in less than one year of observing we might have to be more honest going forward
if we want to use resources efficiently, we need to face a trade-off between efficiency and discovery I I
At the present, everything is heuristics. I say we make this trade-off explicitly, not implicitly.
utopia I
every part of your data analysis pipeline returns a likelihood function I
I
I
you can simulate data under different experimental designs I
I
likelihood is p(data | parameters)
you have a specified utility function I
I
information propagation through the pipeline always by likelihood function implications are severe
converts information in your answer into dollars
every decision can now be an optimization I I I
detectors, optical path, spectral elements filters, exposure times, cadences targets
example: bandpasses
I
LSST plans to do imaging in ugrizy
I
I am going to smash that r filter! why not do ugWizy ?
I
I I I
easy example because zero-cost change doesn’t require full utility specification bet it is much better for low-s/n objects
hardware vs software trades
I
P1640 I I
I
Oppenheimer et al. arXiv:1303.2627 Fergus et al. in prep
glitter cam I
Fergus et al. MIT-CSAIL-TR-2006-058
open-source surveys
I
Hipparcos example
I
SDSS calibration example
I
enormous benefits accrue from making the data re-reducable from scratch
throwing down the gauntlet
I
Gaia uncertainty propagation (qualitative)
I
Euclid observing strategy for imaging
I
LSST bandpass, cadence, and exposure-time settings
I
SKA pathfinder image products
I
eBOSS two-point function estimators APOGEE & HERMES signal-to-noise requirements
I
I I
(My hourly rates are a bargain.) (These surveys are all awesome!)
punchlines I
Calibration programs are wasteful and reduce the accuracy of your end-of-mission results. I
I
Homogeneity and uniformity of survey samples are impossible, unnecessary, and harmful goals. I
I
(you will need to implement some probability theory)
Proper uncertainty propagation is not easy. I
I
(you will need to adjust your observing strategy)
(I got nothing)
The challenge is to make precise measurements and keep discovery space open. I
(you will need to understand, quantitatively, your goals)