Astronomical Image Processing with Hadoop Keith Wiley* Survey Science Group Dept. of Astronomy, Univ. of Washington *
Keith Wiley, Andrew Connolly, YongChul Kwon, Magdalena Balazinska, Bill Howe, Jeffrey Garder, Simon Krughoff, Yingyi Bu, Sarah Loebman and Matthew Kraus
1
Acknowledgments This work is supported by the: › NSF Cluster Exploratory (CluE) grant IIS-0844580. › NASA grant 08-AISR08-0081.
2
Session Agenda
Astronomical Survey Science Image Coaddition Implementing Coaddition within MapReduce Optimizing the Coaddition Process Conclusions Future Work
3
Astronomical Topics of Study
Dark energy Large scale structure of universe Gravitational lensing Asteroid detection/tracking
4
What is Astronomical Survey Science? Dedicated sky surveys, usually from a single calibrated telescope/camera pair. Run for years at a time. Gather millions of images and TBs of storage*. Require high-throughput data reduction pipelines. Require sophisticated off-line data analysis tools.
*
Next generation surveys will gather PBs of 5image data.
Sky Surveys: Today and Tomorrow
SDSS* (1999-2005) Founded in part by UW 1/4 of the sky 80TBs total data
LSST† (2015-2025) 8.4m mirror, 3.2 gigapixel camera Half sky every three nights 30TB per night... ...one SDSS every three nights 60PBs total (nonstop ten years) 1000s of exposures of each location
*
Sloan Digital Sky Survey † Large Synoptic Survey Telescope
6
That’s a person!
FITS (Flexible Image Transport System) Common astronomical image representation file format Metadata tags (like EXIF): › Most importantly: Precise astrometry* › Other: • Geolocation (telescope location) • Sky conditions, image quality, etc. Bottom line: › An image format that knows where it is looking.
*
Position on sky
7
Image Coaddition Give multiple partially overlapping images and a query (color and sky bounds): › Find images’ intersections with the query bounds. › Project bitmaps to the bounds. › Stack and mosaic into a final product.
8
Image Coaddition
Expensive
Give multiple partially overlapping images and a query (color and sky bounds): › Find images’ intersections with the query bounds. › Project bitmaps to the bounds. › Stack and mosaic into a final product. Cheap
9
Image Stacking (Signal Averaging) Stacking improves SNR: › Makes fainter objects visible. Example (SDSS, Stripe 82): › Top: Single image, R-band › Bottom: 79-deep stack • (~9x SNR improvement) • Numerous additional detections Variable conditions (e.g., atmosphere, PSF, haze) mean stacking algorithm complexity can exceed a mere sum. 10
Existing Image Coaddition Systems SWarp › Multi-threaded parallelism (single machine only). SDSS coadds of Stripe 82 (Fermilab) › Same dataset used in our work. › One-off project – not a general-purpose tool. Montage (run on TeraGrid) › Most similar to our work. › MPI (complicated), TeraGrid (dedicated, expensive). MapReduce (our work, this talk) › High-level, potentially simpler to program. › Scalable on cheap commodity hardware (the cloud). 11
Advantages of MapReduce (Hadoop) High-level problem description. No effort spent on internode communication, message-passing, etc. Programmed in Java (accessible to most science researchers, not just computer scientists and engineers). Runs on cheap commodity hardware, potentially in the cloud, e.g., Amazon’s EC2. Scalable: 1000s of nodes can be added to the cluster with no modification to the researcher’s software. Large community of users/support. 12
Hadoop A massively parallel database-processing system: › In one sense: a parallel computing system (a cluster) › In another sense: a parallel database › It’s both! HDFS
Mapper Cluster
Reducer Cluster Final output
Data stored on HDFS (Hadoop Distributed File System) 100s of computers in a cluster
Data processed by numerous parallel Mappers
Data further processed by numerous parallel Reducers
(programs on the HDFS computers)
(programs on the HDFS computers)
13
Hadoop job done, data fully processed
Coaddition in Hadoop Input FITS image
Mapper Projected intersection
Detect intersection with query bounds. Project bitmap to query’s coord sys.
Reducer HDFS
Stack and mosaic projected intersections.
Mapper
Mapper
Mapper
Parallel by image
Parallel by query 14
Final coadd
Driver Prefiltering To assist the process we prefilter the FITS files in the driver. SDSS camera has 30 CCDs: › 5 colors › 6 abutting strips of sky Query bounds Relevant FITS Prefilter-excluded FITS False positives
Prefilter (path glob) by color and sky coverage (single axis): › Exclude many irrelevant FITS files. › Sky coverage filter is only single axis: • Thus, false positives slip through... ...to be discarded in the mappers. 15
1 FITS
Performance Analysis 45
Running time: › 2 query sizes › Run against 1/10th of SDSS (100,058 FITS)
40 35
Minutes
30 25 20
Conclusion: › Considering the small dataset, this is too slow!
15 10 5 0
1° sq (3885 FITS) Query Sky Bounds Extent
¼° sq (465 FITS)
Error bars show 95% confidence intervals — Outliers removed via Chauvenet
16
›
Remember 42 minutes for the next slide.
Performance Analysis Driver
MapReduce
runJob() main()
40
main() is sum of: › Driver › runJob()
30
Minutes
Breakdown of large query running time
20
10
0
Degl
ob In
Redu r Ma p p Cons cer D unJob() t e r u r c D t o put P File S one ne aths plits Hadoop Stage
17
main
()
runJob() is sum of MapReduce parts.
Performance Analysis 40
RPCs from client to HDFS
Breakdown of large query running time
Total job time
Observation: › Running time dominated by RPCs from client to HDFS to process 1000s of FITS file paths.
Minutes
30
20
10
0
Degl
ob In
Redu r Ma p p Cons cer D unJob() t e r u r c D t o put P File S one ne aths plits Hadoop Stage
18
main
()
Conclusion: › Need to reduce number of files.
Sequence Files Sequence files group many small files into a few large files. Just what we need! Real-time images may not be amenable to logical grouping. › Therefore, sequence files filled in an arbitrary manner: FITS
FITS assignment to Seq
filename
FITS
Sequence
Hash Function FITS
Sequence
FITS
19
Performance Analysis 45
FITS Seq hashed
40 35
Minutes
30 25
Comparison: › FITS input vs. unstructured sequence file input*
20
Conclusion: › 5x speedup!
15 10 5 0
1° sq (3885 FITS) Query Sky Bounds Extent
¼° sq (465 FITS)
Error bars show 95% confidence intervals — Outliers removed via Chauvenet
*
360 seq files in hashed seq DB.
20
Hmmm... Can we do better?
Performance Analysis 45 40 35
Processed a subset of the database after prefiltering
FITS Seq hashed
Minutes
30 25
Comparison: › FITS input vs. unstructured sequence file input*
20 15 10 5 0
Conclusion: › 5x speedup!
Processed the entire database, but still ran faster 1° sq (3885 FITS) Query Sky Bounds Extent
¼° sq (465 FITS)
Error bars show 95% confidence intervals — Outliers removed via Chauvenet
*
360 seq files in hashed seq DB.
21
Hmmm... Can we do better?
Structured Sequence Files Similar to the way we prefiltered FITS files... SDSS camera has 30 CCDs: › 5 colors › 6 abutting strips of sky › Thus, 30 sequence file types
Query bounds Relevant FITS Prefilter-excluded FITS False positives
Prefilter by color and sky coverage (single axis): › Exclude irrelevant sequence files. › Still have false positives. › Catch them in the mappers as before. 22
1 FITS
Performance Analysis 45
FITS Seq hashed Seq grouped, prefiltered
40 35
Minutes
30 25 20 15 10 5 0
1° sq (3885 FITS) Query Sky Bounds Extent
¼° sq (465 FITS)
Error bars show 95% confidence intervals — Outliers removed via Chauvenet
*
360 seq files in hashed seq DB. † 1080 seq files in structured DB.
23
Comparison: › FITS vs. unstructured sequence* vs. structured sequence files† Conclusion: › Another 2x speedup for the large query, 1.5x speedup for the small query.
Performance Analysis 10 9
Breakdown of large query running time
Seq hashed Seq grouped, prefiltered
8 7
Prediction: › Prefiltering should gain performance in the mapper.
Minutes
6 5 4 3 2
Does it?
1 0
Degl
ob In
Cons runJo R Ma p p b() er Do educer D put P truct File one ne Splits aths Hadoop Stage
24
main
()
Performance Analysis 10 9
Breakdown of large query running time
Seq hashed Seq grouped, prefiltered
8 7
Prediction: › Prefiltering should gain performance in the mapper.
Minutes
6 5 4 3 2 1 0
Degl
ob In
Cons runJo R Ma p p b() er Do educer D put P truct File one ne Splits aths Hadoop Stage
25
main
()
Conclusion: › Yep, just as expected.
Performance Analysis 45
FITS Seq hashed Seq grouped, prefiltered
40 35
Minutes
30 25 20
How much of this database is Hadoop churning through?
15 10 5 0
Experiments were performed on a 100,058 FITS database (1/10th SDSS).
1° sq (3885 FITS) Query Sky Bounds Extent
¼° sq (465 FITS)
Error bars show 95% confidence intervals — Outliers removed via Chauvenet
26
Performance Analysis 45
FITS Seq hashed Seq grouped, prefiltered
40 35
13,415 FITS files 499 mappers‡
Minutes
30
100,058 FITS files 360 sequence files* 2077 mappers‡
25 20
Comparison: › Number of FITS considered in mappers vs. number contributing to coadd
15
13,335 FITS files 144 sequence files† 341 mappers‡
10 5 0
3885 FITS Query Sky Bounds Extent
1° sq (3885 FITS)
¼° sq (465 FITS)
Error bars show 95% confidence intervals — Outliers removed via Chauvenet
*
360 seq files in hashed seq DB. † 1080 seq files in structured DB. ‡ 800 mapper slots on cluster.
27
Conclusion: › Mappers must discard many FITS files due to nonoverlap of query bounds.
Using SQL to Find Intersections Store all image colors and sky bounds in a database: › First, query color and intersections via SQL. › Second, send only relevant images to MapReduce. Consequence: All images processed by mappers contribute to coadd. No time wasted considering irrelevant images. SQL Database
(1) Retrieve FITS filenames that overlap query bounds Driver
(2) Only load relevant images into MapReduce 28
MapReduce
Performance Analysis 10
Seq hashed Seq grouped, prefiltered SQL ! Seq hashed SQL ! Seq grouped
9 8 7
Minutes
6 5 4 3 2 1 0
1° sq (3885 FITS) Query Sky Bounds Extent
¼° sq (465 FITS)
29
Comparison: › nonSQL vs. SQL Conclusion: › Sigh, no major improvement (SQL is not remarkably superior to nonSQL for given pairs of bars).
Performance Analysis 10
Seq hashed Seq grouped, prefiltered SQL ! Seq hashed SQL ! Seq grouped
9 8 7
Minutes
6 5 4 3 2 1
100058 3885 13335 3885
13335 3885
2077 0
1714 341
338
1° sq (3885 FITS)
Num mapper input records (FITS) Num launched mappers
Query Sky Bounds Extent
100058 465 6674 465 2149
Comparable performance here makes sense: › In essence, prefiltering and SQL performed similar tasks, albeit with 3.5x different mapper inputs (FITS).
499 144
¼° sq (465 FITS)
30
128
Conclusion: › Cost of discarding many images in the nonSQL case was negligible.
Performance Analysis 10
Seq hashed Seq grouped, prefiltered SQL ! Seq hashed SQL ! Seq grouped
9 8 7
Minutes
6 5 4 3 2
100058 3885 100058 3885 13335
3885
Num mapper input records (FITS)
100058 465 6674 465
1 2077 0
1714 341
338
1° sq (3885 FITS)
Num launched mappers
Query Sky Bounds Extent
2149
499 144
¼° sq (465 FITS)
31
128
Low improvement for SQL in the hashed case is surprising at first › ...especially considering 26x different mapper inputs (FITS)!
Performance Analysis 10
Seq hashed Seq grouped, prefiltered SQL ! Seq hashed SQL ! Seq grouped
9 8 7
Low improvement for SQL in the hashed case is surprising at first
Minutes
6 5 4 3 2
100058 3885 100058 3885 13335
1 2077 0
3885
1714 1714 338 338
341
1° sq (3885 FITS)
Num mapper input records (FITS) Num launched mappers
Query Sky Bounds Extent
100058 465 6674 465 2149
499 144
128
¼° sq (465 FITS)
Theory: › Scattered distribution of relevant FITS prevented efficient mapper reuse.
Starting each mapper is expensive. This overhead hurt overall performance. 32
Results 10
Seq hashed Seq grouped, prefiltered SQL ! Seq hashed SQL ! Seq grouped
9 8
Just to be clear: ›
Prefiltering improved due to reduction of mapper load.
›
SQL improved due to data locality and more efficient mapper allocation – the required work was unaffected (3885 FITS).
7
Minutes
6 5 4 3 2
100058 3885 13335 3885
Num mapper input records (FITS)
100058 465 6674 465
1 2077 0
1714 341
338
1° sq (3885 FITS)
Num launched mappers
Query Sky Bounds Extent
2149
499 144
¼° sq (465 FITS)
33
128
Utility of SQL Method Despite our results (which show SQL to be equivalent to prefiltering)... ...we predict that SQL should outperform prefiltering on larger databases. Why? › Prefiltering would contend with an increasing number of false positives in the mappers*. › SQL would incur little additional overhead. No experiments on this yet. *
A spacing-filling curve for grouping the data34 might help.
Conclusions Packing many small files into a few large files is essential. Structured packing and associated prefiltering offers significant gains (reduces mapper load). SQL prefiltering of unstructured sequence files yields little improvement (failure to combine scattered HDFS file-splits leads to mapper bloat). SQL prefiltering of structured sequence files performs comparably to driver prefiltering, but we anticipate superior performance on larger databases. On a shared cluster (e.g. the cloud), performance variance is high – doesn’t bode well for online applications. Also makes precise performance profiling difficult.
35
Future Work Parallelize the reducer. Less conservative CombineFileSplit builder. Conversion to C++, usage of existing C++ libraries. Query by time-range. Increase complexity of projection/interpolation: › PSF matching Increase complexity of stacking algorithm: › Convert straight sum to weighted sum by image quality. Work toward the larger science goals: › Study the evolution of galaxies. › Look for moving objects (asteroids, comets). › Implement fast parallel machine learning algorithms for detection/classification of anomalies.
36
Questions?
[email protected]
37