Astronomical Image Processing with Hadoop

6 downloads 124 Views 2MB Size Report
te s. Seq hashed. Seq grouped, prefiltered. SQL ! Seq hashed. SQL ! Seq grouped. 2077. 1714. 2149. 499. 341. 338. 144. 128. Num launched mappers. 100058.
Astronomical Image Processing with Hadoop Keith Wiley* Survey Science Group Dept. of Astronomy, Univ. of Washington *

Keith Wiley, Andrew Connolly, YongChul Kwon, Magdalena Balazinska, Bill Howe, Jeffrey Garder, Simon Krughoff, Yingyi Bu, Sarah Loebman and Matthew Kraus

1

Acknowledgments  This work is supported by the: › NSF Cluster Exploratory (CluE) grant IIS-0844580. › NASA grant 08-AISR08-0081.

2

Session Agenda      

Astronomical Survey Science Image Coaddition Implementing Coaddition within MapReduce Optimizing the Coaddition Process Conclusions Future Work

3

Astronomical Topics of Study    

Dark energy Large scale structure of universe Gravitational lensing Asteroid detection/tracking

4

What is Astronomical Survey Science?  Dedicated sky surveys, usually from a single calibrated telescope/camera pair.  Run for years at a time.  Gather millions of images and TBs of storage*.  Require high-throughput data reduction pipelines.  Require sophisticated off-line data analysis tools.

*

Next generation surveys will gather PBs of 5image data.

Sky Surveys: Today and Tomorrow    

SDSS* (1999-2005) Founded in part by UW 1/4 of the sky 80TBs total data

 LSST† (2015-2025)  8.4m mirror, 3.2 gigapixel camera  Half sky every three nights  30TB per night... ...one SDSS every three nights  60PBs total (nonstop ten years)  1000s of exposures of each location

*

Sloan Digital Sky Survey † Large Synoptic Survey Telescope

6

That’s a person!

FITS (Flexible Image Transport System)  Common astronomical image representation file format  Metadata tags (like EXIF): › Most importantly: Precise astrometry* › Other: • Geolocation (telescope location) • Sky conditions, image quality, etc.  Bottom line: › An image format that knows where it is looking.

*

Position on sky

7

Image Coaddition  Give multiple partially overlapping images and a query (color and sky bounds): › Find images’ intersections with the query bounds. › Project bitmaps to the bounds. › Stack and mosaic into a final product.

8

Image Coaddition

Expensive

 Give multiple partially overlapping images and a query (color and sky bounds): › Find images’ intersections with the query bounds. › Project bitmaps to the bounds. › Stack and mosaic into a final product. Cheap

9

Image Stacking (Signal Averaging)  Stacking improves SNR: › Makes fainter objects visible.  Example (SDSS, Stripe 82): › Top: Single image, R-band › Bottom: 79-deep stack • (~9x SNR improvement) • Numerous additional detections  Variable conditions (e.g., atmosphere, PSF, haze) mean stacking algorithm complexity can exceed a mere sum. 10

Existing Image Coaddition Systems  SWarp › Multi-threaded parallelism (single machine only).  SDSS coadds of Stripe 82 (Fermilab) › Same dataset used in our work. › One-off project – not a general-purpose tool.  Montage (run on TeraGrid) › Most similar to our work. › MPI (complicated), TeraGrid (dedicated, expensive).  MapReduce (our work, this talk) › High-level, potentially simpler to program. › Scalable on cheap commodity hardware (the cloud). 11

Advantages of MapReduce (Hadoop)  High-level problem description. No effort spent on internode communication, message-passing, etc.  Programmed in Java (accessible to most science researchers, not just computer scientists and engineers).  Runs on cheap commodity hardware, potentially in the cloud, e.g., Amazon’s EC2.  Scalable: 1000s of nodes can be added to the cluster with no modification to the researcher’s software.  Large community of users/support. 12

Hadoop  A massively parallel database-processing system: › In one sense: a parallel computing system (a cluster) › In another sense: a parallel database › It’s both! HDFS

Mapper Cluster

Reducer Cluster Final output

Data stored on HDFS (Hadoop Distributed File System) 100s of computers in a cluster

Data processed by numerous parallel Mappers

Data further processed by numerous parallel Reducers

(programs on the HDFS computers)

(programs on the HDFS computers)

13

Hadoop job done, data fully processed

Coaddition in Hadoop Input FITS image

Mapper Projected intersection

Detect intersection with query bounds. Project bitmap to query’s coord sys.

Reducer HDFS

Stack and mosaic projected intersections.

Mapper

Mapper

Mapper

Parallel by image

Parallel by query 14

Final coadd

Driver Prefiltering  To assist the process we prefilter the FITS files in the driver.  SDSS camera has 30 CCDs: › 5 colors › 6 abutting strips of sky Query bounds Relevant FITS Prefilter-excluded FITS False positives

 Prefilter (path glob) by color and sky coverage (single axis): › Exclude many irrelevant FITS files. › Sky coverage filter is only single axis: • Thus, false positives slip through... ...to be discarded in the mappers. 15

1 FITS

Performance Analysis 45

 Running time: › 2 query sizes › Run against 1/10th of SDSS (100,058 FITS)

40 35

Minutes

30 25 20

 Conclusion: › Considering the small dataset, this is too slow!

15 10 5 0

1° sq (3885 FITS) Query Sky Bounds Extent

¼° sq (465 FITS)

Error bars show 95% confidence intervals — Outliers removed via Chauvenet

16



Remember 42 minutes for the next slide.

Performance Analysis Driver

MapReduce

runJob() main()

40

 main() is sum of: › Driver › runJob()

30

Minutes

 Breakdown of large query running time

20

10

0

Degl

ob In

Redu r Ma p p Cons cer D unJob() t e r u r c D t o put P File S one ne aths plits Hadoop Stage

17

main

()

 runJob() is sum of MapReduce parts.

Performance Analysis 40

RPCs from client to HDFS

 Breakdown of large query running time

Total job time

 Observation: › Running time dominated by RPCs from client to HDFS to process 1000s of FITS file paths.

Minutes

30

20

10

0

Degl

ob In

Redu r Ma p p Cons cer D unJob() t e r u r c D t o put P File S one ne aths plits Hadoop Stage

18

main

()

 Conclusion: › Need to reduce number of files.

Sequence Files  Sequence files group many small files into a few large files.  Just what we need!  Real-time images may not be amenable to logical grouping. › Therefore, sequence files filled in an arbitrary manner: FITS

FITS assignment to Seq

filename

FITS

Sequence

Hash Function FITS

Sequence

FITS

19

Performance Analysis 45

FITS Seq hashed

40 35

Minutes

30 25

 Comparison: › FITS input vs. unstructured sequence file input*

20

 Conclusion: › 5x speedup!

15 10 5 0

1° sq (3885 FITS) Query Sky Bounds Extent

¼° sq (465 FITS)

Error bars show 95% confidence intervals — Outliers removed via Chauvenet

*

360 seq files in hashed seq DB.

20

 Hmmm... Can we do better?

Performance Analysis 45 40 35

Processed a subset of the database after prefiltering

FITS Seq hashed

Minutes

30 25

 Comparison: › FITS input vs. unstructured sequence file input*

20 15 10 5 0

 Conclusion: › 5x speedup!

Processed the entire database, but still ran faster 1° sq (3885 FITS) Query Sky Bounds Extent

¼° sq (465 FITS)

Error bars show 95% confidence intervals — Outliers removed via Chauvenet

*

360 seq files in hashed seq DB.

21

 Hmmm... Can we do better?

Structured Sequence Files  Similar to the way we prefiltered FITS files...  SDSS camera has 30 CCDs: › 5 colors › 6 abutting strips of sky › Thus, 30 sequence file types

Query bounds Relevant FITS Prefilter-excluded FITS False positives

 Prefilter by color and sky coverage (single axis): › Exclude irrelevant sequence files. › Still have false positives. › Catch them in the mappers as before. 22

1 FITS

Performance Analysis 45

FITS Seq hashed Seq grouped, prefiltered

40 35

Minutes

30 25 20 15 10 5 0

1° sq (3885 FITS) Query Sky Bounds Extent

¼° sq (465 FITS)

Error bars show 95% confidence intervals — Outliers removed via Chauvenet

*

360 seq files in hashed seq DB. † 1080 seq files in structured DB.

23

 Comparison: › FITS vs. unstructured sequence* vs. structured sequence files†  Conclusion: › Another 2x speedup for the large query, 1.5x speedup for the small query.

Performance Analysis 10 9

 Breakdown of large query running time

Seq hashed Seq grouped, prefiltered

8 7

 Prediction: › Prefiltering should gain performance in the mapper.

Minutes

6 5 4 3 2

 Does it?

1 0

Degl

ob In

Cons runJo R Ma p p b() er Do educer D put P truct File one ne Splits aths Hadoop Stage

24

main

()

Performance Analysis 10 9

 Breakdown of large query running time

Seq hashed Seq grouped, prefiltered

8 7

 Prediction: › Prefiltering should gain performance in the mapper.

Minutes

6 5 4 3 2 1 0

Degl

ob In

Cons runJo R Ma p p b() er Do educer D put P truct File one ne Splits aths Hadoop Stage

25

main

()

 Conclusion: › Yep, just as expected.

Performance Analysis 45

FITS Seq hashed Seq grouped, prefiltered

40 35

Minutes

30 25 20

 How much of this database is Hadoop churning through?

15 10 5 0

 Experiments were performed on a 100,058 FITS database (1/10th SDSS).

1° sq (3885 FITS) Query Sky Bounds Extent

¼° sq (465 FITS)

Error bars show 95% confidence intervals — Outliers removed via Chauvenet

26

Performance Analysis 45

FITS Seq hashed Seq grouped, prefiltered

40 35

13,415 FITS files 499 mappers‡

Minutes

30

100,058 FITS files 360 sequence files* 2077 mappers‡

25 20

 Comparison: › Number of FITS considered in mappers vs. number contributing to coadd

15

13,335 FITS files 144 sequence files† 341 mappers‡

10 5 0

3885 FITS Query Sky Bounds Extent

1° sq (3885 FITS)

¼° sq (465 FITS)

Error bars show 95% confidence intervals — Outliers removed via Chauvenet

*

360 seq files in hashed seq DB. † 1080 seq files in structured DB. ‡ 800 mapper slots on cluster.

27

 Conclusion: › Mappers must discard many FITS files due to nonoverlap of query bounds.

Using SQL to Find Intersections  Store all image colors and sky bounds in a database: › First, query color and intersections via SQL. › Second, send only relevant images to MapReduce.  Consequence: All images processed by mappers contribute to coadd. No time wasted considering irrelevant images. SQL Database

(1) Retrieve FITS filenames that overlap query bounds Driver

(2) Only load relevant images into MapReduce 28

MapReduce

Performance Analysis 10

Seq hashed Seq grouped, prefiltered SQL ! Seq hashed SQL ! Seq grouped

9 8 7

Minutes

6 5 4 3 2 1 0

1° sq (3885 FITS) Query Sky Bounds Extent

¼° sq (465 FITS)

29

 Comparison: › nonSQL vs. SQL  Conclusion: › Sigh, no major improvement (SQL is not remarkably superior to nonSQL for given pairs of bars).

Performance Analysis 10

Seq hashed Seq grouped, prefiltered SQL ! Seq hashed SQL ! Seq grouped

9 8 7

Minutes

6 5 4 3 2 1

100058 3885 13335 3885

13335 3885

2077 0

1714 341

338

1° sq (3885 FITS)

Num mapper input records (FITS) Num launched mappers

Query Sky Bounds Extent

100058 465 6674 465 2149

 Comparable performance here makes sense: › In essence, prefiltering and SQL performed similar tasks, albeit with 3.5x different mapper inputs (FITS).

499 144

¼° sq (465 FITS)

30

128

 Conclusion: › Cost of discarding many images in the nonSQL case was negligible.

Performance Analysis 10

Seq hashed Seq grouped, prefiltered SQL ! Seq hashed SQL ! Seq grouped

9 8 7

Minutes

6 5 4 3 2

100058 3885 100058 3885 13335

3885

Num mapper input records (FITS)

100058 465 6674 465

1 2077 0

1714 341

338

1° sq (3885 FITS)

Num launched mappers

Query Sky Bounds Extent

2149

499 144

¼° sq (465 FITS)

31

128

 Low improvement for SQL in the hashed case is surprising at first › ...especially considering 26x different mapper inputs (FITS)!

Performance Analysis 10

Seq hashed Seq grouped, prefiltered SQL ! Seq hashed SQL ! Seq grouped

9 8 7

 Low improvement for SQL in the hashed case is surprising at first

Minutes

6 5 4 3 2

100058 3885 100058 3885 13335

1 2077 0

3885

1714 1714 338 338

341

1° sq (3885 FITS)

Num mapper input records (FITS) Num launched mappers

Query Sky Bounds Extent

100058 465 6674 465 2149

499 144

128

¼° sq (465 FITS)

 Theory: › Scattered distribution of relevant FITS prevented efficient mapper reuse.

Starting each mapper is expensive. This overhead hurt overall performance. 32

Results 10

Seq hashed Seq grouped, prefiltered SQL ! Seq hashed SQL ! Seq grouped

9 8

 Just to be clear: ›

Prefiltering improved due to reduction of mapper load.



SQL improved due to data locality and more efficient mapper allocation – the required work was unaffected (3885 FITS).

7

Minutes

6 5 4 3 2

100058 3885 13335 3885

Num mapper input records (FITS)

100058 465 6674 465

1 2077 0

1714 341

338

1° sq (3885 FITS)

Num launched mappers

Query Sky Bounds Extent

2149

499 144

¼° sq (465 FITS)

33

128

Utility of SQL Method  Despite our results (which show SQL to be equivalent to prefiltering)...  ...we predict that SQL should outperform prefiltering on larger databases.  Why? › Prefiltering would contend with an increasing number of false positives in the mappers*. › SQL would incur little additional overhead.  No experiments on this yet. *

A spacing-filling curve for grouping the data34 might help.

Conclusions  Packing many small files into a few large files is essential.  Structured packing and associated prefiltering offers significant gains (reduces mapper load).  SQL prefiltering of unstructured sequence files yields little improvement (failure to combine scattered HDFS file-splits leads to mapper bloat).  SQL prefiltering of structured sequence files performs comparably to driver prefiltering, but we anticipate superior performance on larger databases.  On a shared cluster (e.g. the cloud), performance variance is high – doesn’t bode well for online applications. Also makes precise performance profiling difficult.

35

Future Work Parallelize the reducer. Less conservative CombineFileSplit builder. Conversion to C++, usage of existing C++ libraries. Query by time-range. Increase complexity of projection/interpolation: › PSF matching  Increase complexity of stacking algorithm: › Convert straight sum to weighted sum by image quality.  Work toward the larger science goals: › Study the evolution of galaxies. › Look for moving objects (asteroids, comets). › Implement fast parallel machine learning algorithms for detection/classification of anomalies.     

36

Questions? [email protected]

37

Suggest Documents