WebCom-G: Implementing an Astronomical Data Analysis. Pipeline on a Grid-type Infrastructure. Seathrún ´O Tuairisg, Michael Browne, John Cuniffe, and ...
ASTRONOMICAL DATA ANALYSIS SOFTWARE AND SYSTEMS XIV ASP Conference Series, Vol. 347, 2005 P. L. Shopbell, M. C. Britton, and R. Ebert, eds.
WebCom-G: Implementing an Astronomical Data Analysis Pipeline on a Grid-type Infrastructure ´ Tuairisg, Michael Browne, John Cuniffe, and Andrew Seathr´ un O Shearer Computational Astrophysics Laboratory, Information Technology Department, National University of Ireland, Galway, Ireland John Morrison and Keith Power Centre for Unified Computing, Department of Computer Science, University College Cork, Ireland Abstract. The recent upsurge in astrophysical research applications of grid technologies, coupled with the increase in temporal and spatial sky-coverage by dedicated all-sky surveys and on-line data archives, have afforded us the opportunity to develop an automated image reduction and analysis pipeline. Written using Python and Pyraf, the Python implementation of the IRAF package, this has been tailored to act on data from a number of different astronomical instruments. By exploiting inherent parallelisms within the pipeline, we have augmented this project with the ability to be run over a network of computers. Of particular interest to us is an investigation into the latency penalties in running the pipeline within a cluster and between two clusters. We have used a condensed graph programming model, the Grid middle-ware solution WebComG, to realize Grid-implementation. We describe how a re-organisation of such an astronomical image analysis structure can improve operational efficiency and show how such a paradigm can be extended to other applications of image processing. It is intended to use this project as a test bed for eventually running our image processing applications over a grid network of computers, with a view toward possible implementation as part of a virtual observatory infrastructure.
1.
Introduction
Recent advances in astronomy have revolutionised the manner and depth at which we image the night sky, increasing the information potential which can be gleaned from scientific exposures. Systematic sky-survey archives are now stored at multiple sites and accessed across the globe via high-speed networks. Although modern astronomical research is increasingly geared at the efficient mining and processing of these massive, high-density, distributed data-banks, most astronomical applications have not evolved to deal effectively with this new computational paradigm. Rather than re-code existing structures to enable them to more efficiently deal with the current rigours of astronomical data processing, we propose to take advantage of inherent parallelisms in our data process flows to allow execution across a network of computing resources, using the recently developed middle-ware application WebCom-G (Morrison et al. 325
´ Tuairisg et al. O
326
Figure 1. A schematic of the data processing flow, showing how WebCom acts as the link between the data archive and the processing nodes, which are executed across distributed computing resources.
2001). Our scientific example is a comprehensive survey of supernova remnants in M31. 2.
The Scientific Application
The search for supernova remnants involves mining the data for Hα-bright and morphologically distinct objects (Magnier et al. 1995). A brief schematic of the main parts of the data processing flow, including object detection and subtraction, is outlined in Figure 1. The pipeline was written in Python and utilised Pyraf (the Python implementation of IRAF) tasks. Increased complexity was added by optimizing the point-spread function(PSF)-modeling stage. This involves the subtraction of fainter stars surrounding the stars used for creating the PSF to create a better PSF model. The execution time per image for a single processor is typically several hours, although this can vary significantly depending on stellar density and the sensitivity limits applied to the object detection. The M31 images derive from the Wide Field Survey Isaac Newton Group archive(Lewis, Bunclark & Walton 1999) and are 4-chip (2048×4100 pixels) frames, covering 0.29 degrees squared. Each 4-chip image is 67MB (integer type) giving a total of ∼25GB of raw data for the ∼400 images being analysed. Processing incurs a two-fold data volume increase which include derivative images and photometry files. These are pre-reduced images, simplifying the processing load. 3.
WebCom-G
The WebCom-G meta-computer is being developed with the purpose of providing users with a light interface to the Grid for their applications. It uses condensed graphs (Morrison 1996) to provide an integrated solution from applications to hardware. A graph node fires when all dependencies are accounted for and a free resource becomes available. Our application graph (Figure 1) is a series of nodes representing each image in our archive. More complicated structures in our data pipeline can also be graphed to increase parallelism. WebCom-
A Data Analysis Pipeline on a Grid-type Infrastructure
327
Figure 2. A schematic of the application data flow as executed by WebComG. This node is executed on a single machine and can act on one or more images.
G is compatible with existing grid middle-ware tools such as Globus, Condor and Cactus and security features are currently being implemented. 4.
Hardware and Software
We allowed a 32-node PC cluster (dual 2.4GHz Pentium 4 processors, 2GB memory per node, interconnected via Gigabit Ethernet and running Red Hat Linux) to act as a WebCom client. A shared file system minimised data transfer times between nodes. WebCom-G assigns nodes to machines on an applicationavailable basis by interrogating a database. The astronomical pipeline was written using Python/Pyraf to encapsulate the procedures from the original IRAF ´ Tuairisg et al. 2004). Thus software availability is the only limpipeline (O itation on accessible computer resources. WebCom embeds each node in the graphed application with its own scripts, which handle scheduling and resource allocation. 5.
The Method In Practice
We found this approach to be a simple, flexible and robust way of obtaining significant performance increases in our data reduction and analysis processes. It is a flexible approach because it allows us execute each node (Figure 2) on one or more images. Increased parallelism can be obtained with minimal recoding by further resolving our graph into separate nodes to be executed simultaneously across different machines. We ran our application across a shared file system,
´ Tuairisg et al. O
328
thus by-passing the network bandwidth bottleneck (although this is likely to be a factor when using more widely distributed computing resources). It should be noted that because each image samples a different region of the galaxy, and so execution time varies per node, increased data-splitting (within each image) would incur better performance. 6.
Conclusions and Future Work
We have described a novel method of running astronomical data processing applications, with minimal re-coding, across a network of distributed computing resources using WebCom-G. We benefit by obtaining significant performance increases, yet do so by re-using existing applications. In our example we act on a split data archive, but through expressing operations in WebCom-G’s graphical format, we can exploit increasingly fine-grained parallelisms within our processes. This approach is also perfectly tailored to utilise current Grid6 technologies, which exploit distributed, low-cost computing resources. The Virtual Observatory concept, recently gaining currency (Pierfederici et al. 2001; Quinn et al. 2002; Szalay 2001) could benefit from this approach of increasing performance through re-using existing astronomical applications. We also intend to extend this technique to other scientific applications, including medical imaging. Acknowledgments. We acknowledge the support of Science Foundation Ireland under the WebCom-G program. References Lewis, J.R., Bunclark, P.S, & Walton, N.A. 1999, in ASP Conf. Ser., Vol. 172, ADASS VIII, ed. D. M. Mehringer, R. L. Plante, & D. A. Roberts (San Francisco: ASP), 179 Magnier, E.A., et al. 1995, Astronomy & Astrophysics Supplement, 114, 215 Morrison, J. 1996, PhD Thesis, Technische Universiteit Eindhoven Morrison, J., Power, D., & Kennedy, J. 2001, Journal of Super Computing, 18, 47-63 ´ Tuairisg, S., Butler, R., Golden, A., Shearer, A., Voisin, B., & Micol, A. 2004, in O ASP Conf. Ser., Vol. 314, ADASS XIII, ed. F. Ochsenbein, M. Allen, & D. Egret (San Francisco: ASP), 444 Pierfederici, F., Benvenuti, P., Micol, A., Pirenne, B., & Wicenec, A. 2001, in ASP Conf. Ser., Vol. 238, ADASS X, ed. F. R. Harnden, Jr., F. A. Primini, & H. E. Payne (San Francisco: ASP), 141 Quinn, P.J., Benvenuti, P., Diamond, P.J., Genova, F., Lawrence, A. & Mellier, Y. 2002, in Virtual Observatories, Proc. SPIE, 4846, 1-5 Szalay, A.S. 2001, in ASP Conf. Ser., Vol. 238, ADASS X, ed. F. R. Harnden, Jr., F. A. Primini, & H. E. Payne (San Francisco: ASP), 3