GPU Accelerated Source Extraction in Radio Astronomy ... - CiteSeerX

3 downloads 65237 Views 2MB Size Report
GPU Accelerated Source Extraction in Radio Astronomy: A CUDA Implementation ... 0. Total Marks. 80. 80. DEPARTMENT OF COMPUTER SCIENCE. UNIVERSITY OF CAPE TOWN ...... its own compiler and extensive online documentation.
HONOURS PROJECT REPORT

GPU Accelerated Source Extraction in Radio Astronomy: A CUDA Implementation

Gary Resnick [email protected]

Supervised By: Michelle Kuttel

Patrick Marais

[email protected]

[email protected]

Category 1 2 3 4 5 6 7 8 9

Min

Max

Chosen

0 0 0 0 10 10

15 25 20 15 20 15

15 0 3 15 15 12 10 10 0

Software Engineering/System Analysis Theoretical Analysis Experiment Design and Execution System Development and Implementation Results, Findings and Conclusion Aim Formulation and Background Work Quality of Report Writing and Presentation Adherence to Project Proposal and Quality of Deliverables Overall General Project Evaluation

Total Marks

10 10 0

10

80

DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF CAPE TOWN

2010

i

80

Abstract Next-generation radio telescopes are expected to produce petaflops of astronomical image data of the universe. To complicate matters, the raw astronomical data must be processed in various ways before it is usable by astronomers. Source extraction is the final stage in the astronomical image processing pipeline, where properties and characteristics of astronomical points of interest, or sources, are calculated and reported. There are a number of software packages that automate the complex process of radio source identification and extraction. Still, this process is time consuming, and the identification and extraction of sources from the data in a reasonable timeframe is recognised as a non-trivial algorithmic task. This paper presents two algorithmic optimisations to the approach utilized in an existing package, termed Duchamp, aimed at reducing the execution time of the source extraction routine. Two separate implementations were successfully deployed onto the CPU and Graphics Processing Unit (GPU) respectively. NVIDIA’s Compute Unified Device Architecture (CUDA) was utilized to delegate computationally expensive routines to the GPU. Our CPU implementation employs a memory management scheme to segment and process each image by channel, and the routine to merge non-adjacent sources was redesigned and optimised. This approach resulted in a robust and scalable source extraction solution. However, testing raised a number of issues with both implementations. Firstly, the data structure we developed to house extracted sources for both solutions proved to be sub-optimal. Further, our GPU implementation is constrained by input size, as it has no memory management scheme, and excludes a merging operation. Furthermore, time constraints meant that a hybrid CPU-GPU solution was not attempted. However, both CPU and GPU implementations show good speedups over the original Duchamp code and the project can be considered to be a success. With more time, and the experience gained from these initial prototypes, both implementations could be optimised further.

ii

ACKNOWLEDGEMENTS

The author would like to express sincere gratitude to Dr. Kurt van Heyden and Dr. Sarah Blyth for their invaluable contributions to this report. This report was made possible by the generosity, motivation, guidance, resourcefulness and support amiably demonstrated by Dr. Michelle Kuttel and Dr. Patrick Marais throughout the course of this research.

iii

Table of Contents Abstract ................................................................................................................................................... ii 1

Introduction...................................................................................................................................... 1

2

Background....................................................................................................................................... 2

3

4

2.1

Radio Interference .................................................................................................................. 7

2.2

Source Detection Software ..................................................................................................... 7

2.3

Astronomical Data FITS Format .............................................................................................. 8

2.4

Graphics Processing Unit Architecture and Terminology ....................................................... 8

2.5

General Purpose Computing on Graphics Processing Units ................................................. 10

Design ............................................................................................................................................. 11 3.1

Design Objectives .................................................................................................................. 12

3.2

System Overview................................................................................................................... 13

3.3

Source Detection................................................................................................................... 14

Implementation .............................................................................................................................. 18 4.1

Development Details............................................................................................................. 18

4.2

Two Phase Implementation .................................................................................................. 19

4.2.1

Phase 1: CPU Implementation ...................................................................................... 19

4.2.2

Phase 2: GPU Implementation ...................................................................................... 23

4.3 5

6

Limitations............................................................................................................................. 24

Results ............................................................................................................................................ 25 5.1

Experiment Design ................................................................................................................ 25

5.2

Datasets ................................................................................................................................ 26

5.3

Speedup Results .................................................................................................................... 26

5.4

Summary ............................................................................................................................... 31

Conclusion ...................................................................................................................................... 32 6.1

Future Work .......................................................................................................................... 32

iv

List of Figures Figure 1: CUDA processing flow. ............................................................................................................. 9 Figure 2: Overview of the components within the system ................................................................... 13 Figure 3: 2-Phase design strategy ......................................................................................................... 14 Figure 4: Source detection pipeline ...................................................................................................... 15 Figure 5: Source data structure ............................................................................................................ 20 Figure 6: CPU Connected component labeling scan mask.................................................................... 21 Figure 7: CPU label resolution data structure ....................................................................................... 21 Figure 8: GPU Connected component kernel algorithm ....................................................................... 23 Figure 9: Relationship between execution time and input size ............................................................ 27 Figure 10: Relationship between execution time and number of extracted sources .......................... 28 Figure 11: Breakdown of CPU implementation’s execution time by function. .................................... 29 Figure 12: Breakdown of GPU implementation’s execution time by kernel ........................................ 30

List of Tables Table 1: Existing radio source detection modules .................................................................................. 5 Table 2: Dataset used during testing .................................................................................................... 26 Table 3: Summation of reported results ............................................................................................... 30

v

Chapter 1

Introduction Next-generation telescopes, such as the proposed Square Kilometre Array (SKA) and MeerKat, will produce petabytes of data. The generation of this much data makes manual inspection infeasible and as such, automated identification and classification of points of interest is an intrinsic requisite. Algorithms that attend to these necessities must be adapted to exploit the underlying hardware upon which they operate to execute in an acceptable timeframe. Graphics Processing Units (GPUs) have demonstrated their competence in solving general purpose and computationally expensive problems that are able to map effectively to a parallelizable solution. Source extraction, effectively an image processing family of routines, decomposes naturally into the parallel programming paradigm. NVIDIA’s CUDA exposes low level access to the GPUs architecture through the use of high level programming constructs. Source extraction is the final stage of source identification where properties and characteristics of detected objects, such as their size, velocity and spatial orientation are determined. Astronomers utilize these tools to isolate and inspect points of interest from a reduced set, enabling a quick and efficient investigation into specific sources. Acceleration of the source extraction method is achieved by deploying optimal algorithms, appropriated for source extraction, to both the CPU and GPU. This report concentrates of the approach, design and efficiency of these algorithms.

Research Question This report investigates the acceleration of source extraction routines through their deployment to the GPU. A speedup is considered significant and successful if:   

It provides reliable results (comparable to other source extraction software, see Section 2.2) It improves upon the execution time of Duchamp [30] by a factor of 5 or greater It offers a robust and scalable solution to handle large datasets.

1

Chapter 2

Background Astronomy is the scientific study of celestial objects that arises from the innate human desire to explore the universe. Radio Astronomy, a sub-field of Astronomy, analyses electromagnetic emissions (radiation at radio wavelengths) from astronomical sources to further the understanding of their nature and origin. Radio telescopes, as used in Radio Astronomy, are directional radio antenna that image and detect radio source emissions within the 10⁹nm to 10¹²nm frequency spectrum [1]. These telescopes are used extensively to locate, collate and catalogue the position, frequency emissions or wavelength, and velocity of radio sources. The intensity of the radio frequency energy that reaches Earth is comparatively small when compared to the optical frequency range and thus the radio telescope must have a larger surface area (antenna) in order to be useful [1]. By combining two or more radio telescopes, called arraying, a more detailed higher resolution image can be obtained to more precisely pin-point the source of radiation. This process is heavily reliant on the technique of Radio Interferometry: ascertaining the properties of multiple waves by the superposition of their interference patterns [2]. Before cosmic sources can be recognized, interference must be isolated and extracted from the raw radio images. There exist many sources for interference in the radio frequency spectrum, called 'noise', which further complicates the process of source finding. It is often difficult to distinguish an object’s emissions from extraneous emissions produced by nearby sources [1]. Identifying, cataloguing and ultimately classifying cosmic sources are the basis of much scientific research in Radio Astronomy today. A Large proportion of scientific work done on radio images is not done directly on the images themselves, but on the catalogues of sources and the output they produce [3]. The cataloguing procedure of large continuum and spectral line surveys of extragalactic galaxies and celestial objects are often beyond the scope of manual inspection; and this has driven the development of a multitude of astronomical software platforms. The source extraction problem is thus a well-documented procedure with abundant theoretical analysis of both the noise removal and source finding approaches. The remainder of this chapter decomposes and critically assesses the dominant techniques presented in the literature, provides a comprehensive review of the currently available source extraction software packages, describes the FITS file format, and introduces GPU-specific terminology used throughout this paper.

2

If no assumption about the shape of the sources is to be considered, traditional image segmentation techniques can be applied to the raw image to identify objects. Within image processing, an image object (or blob) can be classified as a discrete identifiable portion of an image that can be interpreted as a single unit. Described below are several mechanisms for discerning objects within an image from background and noise by locating and merging contiguous sets of voxels (volumetric pixels) based on pre-defined parametric criteria. Thresholding, in the field of image processing, is the technique of distinguishing between object 'voxels' and background 'voxels' based on a pre-computed threshold parameter. The threshold acts as a cut-off value which determines whether a voxel belongs to an object or to the background. Defining a threshold parameter is not trivial as any selected value will result in some sources being discarded and some false positives incorporated. In source detection, the consequence of setting a threshold value too high results in many real sources being missed, and setting it too low may detect noise as spurious sources. Due to the conservative nature of radio source catalogues, a high threshold (typically between 5σ - 7σ) is applied, which aims to minimize the likelihood of obtaining spurious detections. The technique of False Discovery Rate (FDR) is a statistical procedure that explicitly controls the fraction of false positives when performing multiple hypothesis testing [4] during threshold determination. FDR works from the assumption that the image has been normalized. A brief summary of normalization in this context is the process of fitting a Gaussian to pixel histograms in image segments of a user specified size. The resultant mean calculated from this segment is subtracted from each pixel within it and subsequently divided by the standard deviation of that segment, creating uniform noise characteristics throughout the image defined by a Gaussian with zero mean and unit standard deviation [4]. The null hypothesis is set such that each pixel is drawn from this Gaussian distribution. Each pixel is then assigned a probability p (which relates to that pixels now normalized intensity) of being drawn from such a distribution. Drawing from an assumption of no present sources, a low probability value acts as an identifier for potential source pixels. A source pixel is a pixel that lies above the threshold and is therefore assumed to be part of a real source. A source is then defined as a contiguous blob of source pixels that translates directly into an astronomical object. SFIND 2.0 from the MIRIAD software suite [16] is an iterative peak detection (or hill climbing) source extraction function that locates the nearest local maximum amongst the contiguous source pixels from the current source pixel. A collection of contiguous, monotonically decreasing pixels are selected from this peak, with pixels that have a FDR p-value below the threshold being discarded. The pixels that remain are marked as belonging to a source to prevent them from being reinvestigated at a later iteration.

3

When compared with algorithms that connect voxels above a given threshold into islands, Peak detection has the advantage of being able to distinguish between two closely spaced objects. However, this method of detection is more appropriate for stars and is not particularly well suited to the detection of low surface brightness objects or extended sources that appear to be two closely spaced objects [5]. Often, a second pass through the data is needed to 'glue' closely located low differential detections. Other examples of thresholding techniques include a simple nσ threshold, called sigma clipping *18+, which defines the threshold as some multiple n of standard deviations from the image noise level or mean, or a manually specified threshold at an arbitrary flux value (e.g. 5mJy). Further readings and example implementations of the Peak Detection algorithm can be found in [5] and [6]. Applications of various thresholding techniques are explored in detail in [3] and [7]. Derived from the insensitivity of traditional thresholding techniques to objects that are in close proximity of one another [3], objects that overlap at similar contours of luminance (isophotes) are incorrectly coalesced into a single object and it is often necessary to split the contributing sources into distinct objects. The technique of splitting objects into their components is called de-blending and requires a re-examination of each object for further analysis. Each object is analysed by applying a linearly (or otherwise) spaced set of thresholds from the detection threshold, the lowest intensity voxels, up to a fixed fraction of its maximum flux (its intensity peak). At each isophote, if the object decomposes into more than one component, the component containing the maximum flux retains its original identification whilst the others are separated as independent objects [8]. This process repeats on the original object until subcomponents are smaller than a specified minimum voxel size. The divorced objects are appended to the list of extracted sources with their isophote set to the contour (or threshold level) at which they were split.

In the figure above, the profile of the object is represented by the smooth outline curve. At each threshold level, a decision is made based on the relative integrated intensity of each branch whether to regard the branch as a separate distinct object. If a branch exists with the integrated intensity greater than a pre-defined fraction of the total intensity of the composite object, and this holds for at least two branches, split the branch and append that object to the list of sources (catalogue). In the above figure A and B are split into two objects.

4

Once objects have been detached, the underlying pixels below the separation threshold must be reallocated to the object that it most likely belongs to. This is calculated from the intensity contribution of each pixel to each object, represented as a probability that determines to which object the pixel belongs. SExtractor applies a bivariate Gaussian fit to each pixel's profile to compute its probability of belonging to each object [3]. Variations of this technique appear in the spacing of threshold levels (linear, exponential or fixed), the decision conditions and the size of the smallest approved object (in terms of contiguous voxels). These techniques are not mutually exclusive and software packages such as Duchamp and SExtractor both provide optimised variants. Various source extraction functions present in AIPS and MIRIAD are summarised in Table 1. AIPS IMFIT/JMFIT SAD/VSAD/HAPPY SFIND 2.0

MIRIAD IMFIT IMSAD

Description Fits multiple Gaussians to all pixels in a defined blob Defines 'islands' of pixels above a user-defined threshold and then fits multiple Gaussians within these 'islands' Defines a threshold using FDR to determine pixels that belong to sources and fits a Gaussian to those pixels.

Table 1: Existing Radio Source Detection Modules

It should be noted that SFIND 2.0, IMFIT and IMSAD use the same elliptical Gaussian fitting routine to measure identified sources. A series of tests were conducted to evaluate the accuracy of the aforementioned algorithms in Table 1. Accuracy is defined by the percentage of true sources successfully detected as well as the level of spurious sources incorrectly detected. The findings of these tests, administered on both synthesized and real radio images, concluded that SExtractor misses significantly more true sources when both are constrained to the same level of false detections [4]. The conservative, yet accurate FDR threshold approach yielded similar results when compared with IMSAD and IMFIT. It is more desirable to incur the fault of missing true sources than accepting spurious detections as the overhead involved in determining the validity of a generated catalogue can be greatly reduced [3]. Once a threshold is applied, and object and background voxels segmented, voxels need to be clustered into complex structures. Connected component labelling is a process that directly succeeds segmentation and is applied to efficiently analyse and construct these objects so as to allow for further processing. Neighbouring source voxels that share the same state are labelled as the same object. There are several kinds of component labellers, differing in their approach for label propagation, storing label equivalence information and the amount of iterations required for label finalization. Connected component labelling was selected as the appropriate technique for source construction as it is simple and efficient approach to traversing and combining data points into multidimensional objects. Several variants of CCL exist that lend themselves towards parallelization and as such, was selected for GPU appropriation. In general, the types of CCL algorithms can be classified as Multipass, Two-pass or One-pass [22]. 5

Multi-pass algorithms operate by labelling connected components over a multitude of iterations. Typically these algorithms perform local neighbour operations for each pass. These operations usually take the form of locating the neighbour in the scan mask with the lowest label or recording the equivalence between both labels. The performance of multi-pass algorithms is highly dependent on the number of iterations, although some implementations have been shown to be highly effective [23]. Two-pass algorithms detect equivalences by resolving equivalence chains and then assigning a final label to each voxel. Two-pass algorithms typically include a scanning phase during which a pass is made through the data, examining each voxels neighbours and recording equivalences between labels. The analysis phase resolves equivalence chains by traversing sequences of equivalent labels to locate the final correct label. The labelling phase requires the second pass through the data during which the final label for each voxel is retrieved and set. Two-pass algorithms generally perform well as they only traverse the data twice. Performance is impacted by the efficiency of the analysis phase (label resolution) and the data structure used to store and resolve equivalences. One-pass algorithms usually entail a recursive analysis of unlabelled voxels. These algorithms traverse the data until an unlabelled voxel is encountered. A new label is assigned to this voxel and all connected voxels that share the same state. This procedure is often performed recursively and as such results in irregular data access patterns, significantly reducing performance. An optimal approach to label propagation is through the use of label equivalence. This way, labels can spread not only to local neighbours but to all labels that have been equated. To store these equivalences and eventually resolve them, a data structure is needed to describe and track these relationships. The Union-Find data structure is an array of rooted trees where the root represents the minimum label of a set of provisional labels. The Union-Find data structure is comprised of three dominant operations: find union and set. The find operation determines the root label that represents that label. The union label merges to trees, usually appending the shorter tree onto the larger, equating the labels of both trees. Set simply creates a new rooted tree with the new label as the root. The Union-Find disjoint set data structure is well suited to the problem of label resolution, however typical implementations utilize software pointers to create and modify the trees according to label equivalences. Accessing memory in this fashion is slow as the pointers often point to randomly distributed locations in memory. An optimised alternative to software pointers, that eliminates the impact of random memory accesses, is to flatten the trees into a single vector. A vector usually resides in a contiguous block of memory and as such offers more predictable memory access.

6

2.1 Radio Interference Complex imaging techniques such as wavelet de-convolution and signal self-calibration [9, 10] are applied in unison to significantly reduce or eliminate noise from radio images. The iterative process of image de-convolution and self-calibration in computer vision is called hybrid mapping [11]. It should be noted that variants and optimizations of the above imaging techniques exist and have been reflected upon recently for the use in massively parallel computer systems [12]. Further details on handling Radio Interference are discussed in [31].

2.2 Source Detection Software There exist several Astronomical data analysis, image processing and source identification packages in mainstream use today. Listed below are the most commonly cited packages in the literature [3]. SExtractor builds a catalogue from cosmic objects extracted from an astronomical image; it detects, de-blends, measures and classifies sources through the underlying use of a basic neural network (NN) implementation. SExtractor makes use of the well documented back-propagation learning principle [3] for source analysis and classification. The process can be described by several distinct steps; estimation of the image background noise, Thresholding, de-blending, detection filtration, object classification (photometry) and finally star/galaxy separation [3]. AIPS++ (Astronomical Image Processing System++), the successor of AIPS [13], was an astronomical processing software package that provides image calibration, editing, enhancement and analysis functionality [14]. With a tool-based approach, modules can be modified and programmed by the user; appropriating the software for a specialized purpose. AIPS++ emphasized a distributed computing approach; designed to exploit massively parallel architectures and cutting edge computing technologies. AIPS++ was maintained by AIPS++ Consortium consisting of several institutions that operate major astronomical facilities before being re-born as CASA (Common Astronomy Software Applications). CASA is now developed by the US National Radio Astronomy Observatory (NRAO) and is a suite of C++ libraries derived from the core set of AIPS++ tasks [15]. These libraries are wrapped in a Python scripting interface for dynamic end-user data processing and analysis. MIRIAD (Multichannel Image Reconstruction, Image Analysis and Display) is a radio Interferometry data reduction software package, utilized by the Australia Telescope Compact Array (ATCA), designed by the Berkeley-Illinois-Maryland Association for the reduction of both continuum and spectral line observations from raw data to image analysis [16]. MIRIAD supports a host of deconvolution techniques (variants of CLEAN and MEM) and several source fitting algorithms (MAXFIT, IMFIT & IMPOS) which fit Gaussians to a region of an image to identify objects [16]. MIRIAD synthesizes, analyses and outputs images of publication quality [17]. MIRIAD was developed as a flexible alternative to AIPS, designed with a programmer-friendly environment for simple extension and customization.

7

Duchamp is a three dimensional radio source finder; it includes optional noise reduction techniques (through the use of wavelet reconstruction) prior to searching. It allows for both spectral and spatial smoothing to reduce the noise and enhance the features in the radio image. Duchamp is dynamic in that it allows the user to specify parameters such as signal-to-noise threshold and the minimum pixel size of detections. It further supports several output formats and uses the industry standard FITS data format (amongst others) as input. Duchamp is the most recent source extraction software suite currently in use and its development remains highly active. It is for these reasons as well as the flexibility of Duchamp that we have utilized it as a performance benchmark.

2.3 Astronomical Data FITS Format The Flexible Image Transport System (FITS) is a file format commonly used in Astronomy to store, transmit and manipulate scientific and other images. FITS is designed specifically for scientific data, provisioning for detailed metadata that is capable of describing both photometric, spatial calibration information and image origin. A FITS file consists of human readable ASCII headers (Header Data Units or HDUs) that are interleaved between data blocks. Information such as size, origin, coordinates, binary data format, comments and data history are commonly stored in the HDU.

2.4 Graphics Processing Unit Architecture and Terminology In this section a brief summary of GPU-related terms used throughout this paper are presented. Graphics Processing Units (GPUs) utilize the Single Instruction Multiple Threads (SIMT) parallel programming paradigm to provide high computational throughput due to their many-core design, encompassing a large number of processing cores. NVIDIA GPUs contain streaming multiprocessors (SMs) each of which contain thirty two CUDA cores and associated memory. SPs within an MP must execute the same instruction synchronously whilst MPs can execute different instructions, called kernels, independently. To efficiently utilize the GPUs processing capabilities, a large number of threads must execute concurrently. The GPU avoids high thread management overhead by implementing thread management and scheduling in hardware. Threads are organised into 3D structures called blocks, and blocks tiled in2D structures called grids. CUDA enabled devices differ with the size of permissible blocks and grids they allow. The current Fermi architecture imposes a block size limit of 1024 threads with maximum sizes of {1024, 1024, 64} in each dimension coupled with a grid size limit of {65535, 65535}. The threads are executed by being assigned to SMs. The SMs split the thread block into sets of 32 threads known as warps. Each thread is described by a unique identifier, from which its position in its block and within the grid can be calculated. NVIDIA’s GPUs provides a memory hierarchy that includes several types of device memory: 8

Global Memory is the main device memory on the GPU and is most abundant. Any input/output to or from the GPU must be placed in global memory. Global memory has the slowest access time but is accessible to any executing thread. The number of global memory transactions can be reduced through a technique called memory coalescing, which is based on the alignment of memory accesses between sequential threads. Shared Memory is stored within SMs and is used to facilitate communication between threads in a thread block. Shared memory is utilized to reduce the number of global memory transactions by acting as a memory access cache between threads. Constant Memory is a cache method for accessing specific pieces of global memory. Constant memory allows multiple threads to simultaneously access the same value out of cache at once. Texture Memory is another cache method that allows multi-dimensional textures to be cached, exploiting the spatial locality of memory accesses on these textures to improve performance.

Figure 1: CUDA Processing Flow. The data is first copied onto the device’s memory. The instructions are deploying onto the device. Kernels are launched to execute the instructions simultaneously on thousands of threads. Finally the results are transferred back to the host (CPU) upon completion. Source: [28]

NVIDIA’s current generation GPU architecture family, Fermi [28], provides improvements over the previous generation in terms of accelerated double precision support, atomic operations and an increased shared memory size for thread blocks.

9

2.5 General Purpose Computing on Graphics Processing Units General Purpose computing on Graphics Processing Units (GPGPU) is the technique of using a GPU for applications/computations typically handled by a Central Processing Unit (CPU). The growing popularity of GPGPU can be attributed to the escalating amount of memory bandwidth and computational horsepower that current generation GPU's provide. GPGPU is possible due to recent innovations in the traditional graphics pipeline that accommodate for flexible (user programmable) vertex and fragment processing [19]. This flexibility allows software engineers to apply stream processing on non-graphically related data, offering orders of performance enhancements to arithmetically intense operations. These attributes are particularly well suited to image processing algorithms which map well into the GPU's parallel stream processing model, such as convolution, edge detection [21] and image registration, which exhibit high datalevel parallelism and suffer from high computational costs [20].

Figure 3: Average execution time speedup over CPU execution time for Image Registration

Source: [20]

In the above figure, the performance of both a GPU and CPU solution were analysed against increasing image sizes for an implementation of image registration, the speedup achieved on the GPU was at worst three times faster than that of the CPU. Similar results were obtained for a GPU optimized convolution solution which attained orders of magnitude of performance increases [21]. With the introduction of NVIDIA's CUDA, programmers are no longer constrained in GPGPU by shader languages such as Cg or GLSL. Instead, CUDA is a natural extension of the C/C++ programming environment with common programming constructs such as arrays and pointers, its own compiler and extensive online documentation.

10

Chapter 3

Design The purpose of this project design is to accelerate the Duchamp source extraction procedure through optimisation and algorithmic redesign such that the routines map well to the GPU's architecture. This report focuses on the source detection family of routines that construct complex 3dimensional sources from regions of high flux. The image is assumed to have been segmented prior to processing. The image must have already undergone de-noising [31]. Source extraction is the final stage in the Radio Astronomy image processing pipeline where object voxels (voxels with a flux value greater than the applied threshold) are aggregated into 3dimensional objects. Properties are then calculated and recorded from these objects in order to assist Astronomers in locating extra-galactic sources in the image for further analysis. The main challenges in source extraction are reducing the computational cost incurred by processing large datasets and managing the data in memory effectively. We opted to remodel and optimise the critical path of the Duchamp algorithm (its core functionality) in a modular framework, with clearly defined interfaces and packages. This modularity is essential to the profiling of each routine against its GPU implementation, allowing component substitution and scope for future work. Thus the routines selected for the task of source extraction were chosen to maintain consistency with Duchamp. As described in our risks mitigation strategy, small iterative and incremental development cycles were adhered to, in order to avoid deviating from the deliverable timeline and to manage project scope through recurring feedback. In the following sections we present our design objectives, provide brief overviews of the components that constitute the entire Duchamp system and describe the source detection package.

11

3.1 Design Objectives The aims of our designs were drawn through the identification and prioritization of constraints and their influence on system performance. The CPU and GPU implementations do not share common costs for memory operations and execution throughput and as such separate measures were designed to accentuate these differences. However, the objective of both implementations is to satisfy the following design goals:

Optimisation Approach The source detection image-processing routine is considered efficient if it minimizes its execution time whilst retaining a near-linear relationship between its memory footprint and the size of the dataset. The Duchamp source detection routines are optimised for both hardware and software through the implementations of algorithms tailored and tuned for the underlying hardware platform and the constraints they impose on performance. A significant speedup factor is considered to be 5 or more times quicker than the original Duchamp derivation.

Result Integrity The implementations must be consistent with and produce results that accurately resemble those obtained from Duchamp. Integrity is evaluated through the use of sample synthesized/real datasets and the rate of successful and spurious detections (the margin of error) it manufactures. Section 5.1 describes the methodology used to determine these rates and to quantify system performance.

System Modularity An object-oriented programming approach logically modularises each system component into functionally isolated routines. Modules are organised in this manner to create an elegant and robust solution that improves system simplicity, legibility and testability. Relationships between modules are clearly defined interfaces that further decouple inter-module dependence.

Application Programming Interface (API) The structure of the system (Figure 2) anticipates the need for a highly structured pre-defined interface between each project member and as such, each system component. Each implementation includes a native API to induce module independence. This decoupling demands an extensive and capable interface that conducts integration into the ‘sandbox’ test framework and between other system components.

12

3.2 System Overview The Duchamp algorithm contains the following procedures:   



Image alteration - encompasses the image modification and reconstruction routines I/O – parses the user input parameters via the command line and configuration file and handles the output of catalogue files, i.e. the list of detections Source extraction - extracts sources from the processed image o FDR Threshold - calculates noise statistics of the cube to segment the image into object and background voxels o Source detection - aggregate object voxels into objects, merges and rejects objects to represent true sources Cfitsio wrapper - wraps the cfitsio library, handling the input and output of astronomical images, i.e. the FITS files

Figure 2: Overview of the components within the system

The source detection component can be further divided into a number of sub-components, where each sub-component represents a distinct family of routines:   

Connected component labelling - sets of adjacent pixels are combined into an object, based on defined heuristics. Object merging - objects that are in user-defined proximity (spatially or spectrally) are combined into sources. Source validation - the source list is pruned based on user-defined parameters.

13

3.3 Source Detection Routines and algorithms within Duchamp are selected for optimisation based on their innate data parallelism and the significance of their contribution to overall computation time. The connected component labelling and object merging routines were selected for optimisation because they constitute a significant proportion of Duchamp's computation time and there was sufficient potential for parallelisation. This potential is due to the algorithms inherent single instruction multiple data (SIMD) nature.

Figure 3: 2-Phase design strategy

We treat Duchamp as a naive CPU implementation and, in two phases, iteratively optimise these components such that computation time is reduced whilst maintaining accuracy. The first phase entails an optimal CPU rewrite of the source detection routines as well as general framework development. The second phase appropriates and deploys the computationally expensive routines to the GPU utilizing NVIDIA’s CUDA. Three stages of data processing have been identified that collectively form an effective pipeline for extra-galactic source detection. The connected component labelling routine creates distinct objects based on the adjacency of contiguous object voxels. The CPU implementation constructs preliminary two-dimensional objects that are then combined to form three-dimensional objects. The GPU implementation naturally extends to the third dimension and as such generates three-dimensional objects. The object merging CPU routine joins two-dimensional objects that are in immediate proximity, forming three-dimensional objects. The source validation routine segregates detected sources based on user-defined specifications, rejecting spurious detections.

14

Figure 4: Functional illustration of the source detection pipeline and the output of each sub-routine

Connected Component Labelling (CCL) This simple image processing technique aggregates contiguous object pixels into elementary objects. The CPU implementation raster-scans the image to generate provisional labels for subsets of connected components (adjacent object pixels) such that the labels assigned are equivalent. These equivalence labels are resolved and then applied on the second pass through the image, adjusting provisional labels into the designated representative label. This routine accepts a segmented 2-dimensional FITS data channel (or slice) as input and outputs a list of objects that contain information about the pixels each encompasses. Segmentation must result in exhaustive and mutually exclusive sets of object and background pixels. The GPU implementation takes a similar label equivalence approach first described by [23], separating the scan, analysis and labelling phases. A kernel is launched for each phase with a thread executing for each voxel. Each thread identifies the adjacent voxel with the lowest label and the relationship is stored. The subsequent kernel resolves label equivalences based on the information generated in the scan kernel. Finally, the labelling kernel updates the appropriate label to each voxel. This routine diverges from the CPU implementation as it makes no attempt to fragment the problem set into chunks, accepting an entire segmented 3-dimensional FITS data cube as input.

CCL CPU optimisation Several optimisation strategies that were not explicitly considered by Duchamp create vast potential for optimising this routine. An optimal object detection routine can be attained by minimizing the occurrences of label equivalence resolution, an expensive sub-routine that merges two (or more) sets of provisional labels under a representative label, and by reducing the comparisons made within the scan mask, reducing expensive comparisons and resulting in predictable memory access patterns. 15

CCL GPU optimisation Due to time constraints, no GPU-specific optimisations were achieved. However, comparisons in the scan mask were reduced and an efficient mechanism for detecting and avoiding edge cases was implemented, reducing the number of inactive (idle) threads. This algorithm lends itself to parallelization as it presents an efficient mechanism for creating regions in binary images based on the adjacent local neighbourhood of each voxel. As such, operations on each voxel are almost independent. Additionally, this approach makes no initial assumption as to the shape of detections as to robustly handle both point and extended sources.

Object Merging This procedure amalgamates two-dimensional objects that are in the user-defined spatial or spectral proximity of each other into three-dimensional sources. This routine accepts a 2-dimensional list of objects as input and returns a list of 3-dimensional sources as output. As each channel may include numerous objects, lists of objects detected are contained in each channel.

Object Merging CPU optimisation By using the spatial and spectral intervals of the objects, interval overlaps can efficiently be determined without the need to traverse the objects entire body of voxels. This routine is often necessary to assemble extended source structures that span a multitude of channels which may not be adjacent.

Object Merging GPU optimisation Due to time constraints, no GPU object merging routine was completed.

16

Source validation Properties of each detected source are analysed to remove spurious detections (such as singlechannel spikes) or sources that fail to meet the criteria specified by the user. Sources must span a minimum number of consecutive channels and include a minimum set of voxels to be accepted as a true source. This routine accepts a list of sources as input and returns a validated (often truncated) list of true sources. Validation ensures a consistent and concise set of results whose accuracy can be adjusted by user defined input parameters (See Appendix A). This allows the user to define detection constraints and engage actively with the result set to fine-tune the system to his/her required specifications.

17

Chapter 4

Implementation The implementation details are discussed in this chapter. Section 4.1 details the libraries, language and platform used during development. Section 4.2 provides implementation details for both of the proposed design phases. Section 4.3 identifies the limits, constraints and shortcomings of the source extraction implementations.

4.1 Development Details Linux Due to the existing supporting open-source libraries on the platform, Ubuntu Linux was selected as the development platform. C++ To ensure interoperability with CUDA and between system components and because efficient access to memory and the CPU was necessary for several system components, C++ was used as the programming language throughout the system. CUDA NVIDIA’s Compute Unified Device Architecture provides developers with direct access to the architecture of the GPU, such as shared memory (see Section 2.4), as well as both a high and low level API for executing procedures on the GPU. CUDA is effectively deployed in C syntax with NVIDIA extensions and was utilized for its reduced learning curve and performance enhancements over other hardware accelerated languages. CFITSIO The FITS format (see Section 2.3) is an archaic and complex file format designed to service a broad set of functionality for large data sets across scientific disciples. The open source CFITSIO v3 library is an interface for C programmers to the FITSIO wrapper that provides a powerful yet simple interface for accessing FITS files. TCLAP An open source command line parser was integrated to allow all input parameters to be specified via the command line. Templatized Command Line Argument Parser (TCLAP) is a simple library that independently parsers the command line and conforms to POSIX standards. Libconfig An open source configuration file parser was integrated to allow for all input parameters to persist between executions. Libconfig is a simple library for processing structured and compact configuration files.

18

Licensing All libraries used directly by the source extraction routines are licensed under the GNU GPL. This implies that any program is free to link against and use these libraries such that they too are released under the GNU GPL license. This source extraction program is therefore released under this license.

4.2 Two Phase Implementation The implementation was split into two phases; this was necessary to monitor progress and ensure that the components that were used throughout the system were completed. The first phase encompassed the systems framework and the CPU implementation for source extraction. The second phase directly succeeded the CPU implementation and appropriated the computationally expensive routines to the GPU.

4.2.1 Phase 1: CPU Implementation Configurator The configurator parses the command line arguments and the configuration file and stores these as parameters that are passed throughout the system. Additionally, it validates the FITS data cube and stores the HDU header information present within the FITS file. The configuration file is parsed first and then overwritten with arguments specified in the command line. Parameters that are required to persist between executions are stored in the configuration file. The duality of the configuration file and the command line arguments introduce an efficient mechanism for adjusting input parameter at run time and as needed.

Sources Sources are stored as a vector of Boolean matrices where the vector position indicates the channel and the position within the Boolean matrix the x- and y- coordinates respectively (Figure 5). The object maintains the channel offset to grow the vector proportionately with the source’s channel span. The minimum and maximum x- and y- coordinates are stored for each channel. Each matrix is initialized to {30, 30} and is grown when a pixel does not fit within the specified domain.

19

Figure 5: Data structure used to house three-dimensional sources. A vector maintains an array of occupied channels: each channel contains a Boolean matrix (depicted here with black equating to true and white to false) which demonstrates the presence of a source voxel at that position.

Sources are grown by remapping the appropriate Boolean matrix into a larger Boolean matrix. The growth factor indicates the additional size (over the minimum required to contain the new pixel) to extend the matrix. This is intended to reduce the number of expensive matrix resizes throughout the lifespan of the source. It is possible that the addition of each new pixel can require a resize and as such a growth factor was introduced to diminish this occurrence. A source is measured by the number of voxels it contains, calculated by the summation of the pixel sets in each channel, and the number of channels (or channel span) it encompasses.

Connected Component Labelling We implemented the two-pass label-equivalence based connected-component labelling algorithm proposed by [22] whereby the first scan assigns provisional labels and computes label equivalence such that all provisional labels that belong to a connected component at each point is combined into the same equivalence label set and hold the same representative label. This algorithm is fundamentally difference when compared to Lutz’s one-pass connected component labelling algorithm utilized in Duchamp [25]. All provisional labels in an equivalence set (who share a representative label) are said to be equivalent. This differs from conventional two-pass algorithms, which calculate label equivalence between scans, by resolving equivalence immediately. This avoids the calculation of the minimum provisional label present in the scan mask (Figure 6) before assigning the current pixel a label.

20

Figure 6: The scan mask for 8-connectivity.

Source: [26]

We scan the pixels bottom up, left to right. The algorithm naturally incorporates path folding, a CCL optimization technique, and eliminates the need to calculate the lowest provisional label present in the scan mask as once a provisional label is assigned its equivalence is immediately calculated. It also retains a predictable data access pattern through 'raster scanning'. The three-dimensional data cube is decomposed into a series of two-dimensional channels and we process these channels as two-dimensional adjacency graphs. The data structure use to store and resolve label equivalences is the Scan Plus Connection (SPC) table proposed by [24]. The SPC table is a one-dimensional vector that is as long as the number of provisional labels. When a new provisional label is appended, its value is set to its index. When a label is found to be equivalent to another, the provisional label with the higher label is set to that of the lower label.

Figure 7: Scan Connection Plus Table proposed by [24]. Initially, each provisional label is set to its index in the table. When a connection is found between two provisional labels, the larger provisional label is resolved to the lower label. Source: [22]

Equivalence resolution is the process of traversing the path created by these equivalence chains until a label is reached that equals its index (it is the minimum label in the chain and points to itself). This label is then propagated up the path, setting each provisional label to this minimum label. The technique of setting each label along the path to the minimum label (or root) is known as path compression and reduces the cost of future label equivalence resolutions. 21

Object Merging We implemented a naïve, computationally inexpensive interval intersection routine to merge nonadjacent sources. Two distinct phases of object merging ultimately result in source amalgamation: intra-channel merging and inter-channel merging. Intra-channel merging replaces the nearest neighbour approach undertaken by Duchamp to determine the minimum distance between two objects. This is achieved by utilizing the intervals of each object to create a bounding box that encompasses its extremities. The bounding box is grown by the maximum spatial separation distance specified by the user at runtime and compared against other objects in the same channel for intersection. Two objects are combined if their bounding boxes intersect. The bounding box approach is inaccurate when compared to nearest neighbour calculations’, being more susceptible to combining nearby sources, but still provides reasonable results (see Section 5.3). The average difference in the quantity of sources extracted is 7.9% with a maximum differential of 12.81%. Inter-channel merging is done after the processing of each channel. The set of two-dimensional objects extracted are compared to and assessed against the spatial and spectral proximity of threedimensional sources already discovered. A two-dimensional object slice is integrated into an existing three-dimensional source if it is in the spatial proximity of at least one channel of the source that is a spectrally proximate channel. In other words, the channel of the two-dimensional object must be spectrally close to a channel of the source that it is in spatial proximity to. If the object satisfies these conditions it is incorporated into the three-dimensional source.

Source Validation We implemented a source validation routine that regulates the output based on the input parameters defined at run time (See Appendix A). This procedure iterates through each source, analysing its properties and characteristics and purges sources that fail to meet the minimum criteria. Examples of these criteria are the minimum number of encompassing voxels and object must consist of to be a true source, as well as the minimum number of channels the object is present in. Sources are removed if they do not meet all the stipulated criteria. Only sources that persist after validation are reported.

22

4.2.2 Phase 2: GPU Implementation

Connected Component Labelling We implemented a multi-pass label-equivalence connected component labelling algorithm described in [27] which is similar in approach to the CPU method defined in [24]. However, the algorithm diverges from the traditional two-pass approach by only recording and resolving fragments of the equivalence table within iterations. Furthermore, the algorithm presented in [27] was extended to label three-dimensional objects. Each kernel launches a thread for every voxel. Four separate kernels were implemented to compute the stages of connected component labelling. The initialization kernel prepares the data and labelling arrays, the scanning kernel examines the local neighbourhood of each voxel, the resolution kernel resolves label equivalences stored in the previous iteration and finally the labelling kernel assigns the resolved labels to each voxel.

Figure 8: A single iteration of the label equivalence propagation algorithm as described in [27]. From left to right: the image after initialization, the initial equivalence array, the equivalence array after the scanning kernel, the equivalence array after the resolution kernel and finally the resultant image after the labelling kernel.

The initialization kernel assigns a unique identifier to each object pixel and zero to each background pixel. A resolution array the same length as the data is initialized concurrently to these same values. For each thread, the scanning kernel examines the neighbouring voxels in each direction for a voxel with a label smaller than the current label. For this exercise and to account for the GPU’s inability to handle thread divergence effectively, only 4-connectivity is assumed. If a neighbouring voxel has a lower label, the resolution array at the current position is equated to this new lower label. The minimum value is set using a CUDA atomic operation. If a lower label was encountered during the scanning phase, a Boolean records the presence of a change. The resolution kernel analyses each position in the resolution array. If the value at this index differs from the current positions actual label, the thread resolves the equivalence chain until the index in the equivalence array is equal to its value. The resolution array at the current position is then updated to this value.

23

The labelling kernel updates the label of each voxel by looking at label of the current voxel in the equivalence array. This implementation cannot label the entire image in a single pass and must execute several iterations until such time that the scanning kernel reports no changes to the labels.

Object Merging Due to time constraints, no implementation for object merging was completed.

4.3 Limitations Both implementations are far from perfect and each has distinct shortcomings. However, both react violently to adjustments made to the input parameters. As such, parameters were kept constant to maintain consistency between implementations. The CPU implementation suffers from the sub-optimal data-structure used to create and manage objects. This is exaggerated in the CPU solution as it creates vastly more objects, first in twodimensions and later merged into three-dimensions. Moreover, the CPU merging algorithm deployed differs substantially from the approach used in Duchamp, adversely affecting the results reported. The GPU implementation is severely constrained by the input size, limited to the size of the device’s on-board memory. Unlike the CPU solution, no memory management algorithm was deployed to segment the data cube and process it in fragments. This directly impacts the maximum size of the input that can be processed. The results obtained from the GPU solution differs as it excludes an object merging routine, however, the GPU innately creates three-dimensional sources and as such the results do not diverge significantly. Finally, the GPU implementation does not adequately exploit the memory hierarchy of the GPU to attain near-theoretically maximum performance.

24

Chapter 5

Results This chapter presents the results for the source extraction schemes. Section 5.1 describes the approach for analysing the data. Section 5.2 details the datasets used during analysis. Section 5.3 expresses and discusses the results obtained from this analysis and this chapter is summarized in Section 5.4.

5.1 Experiment Design Testing was conducted on an Intel Quad Core 2.66GHz equipped machine with a NVIDIA Geforce 470 GTX GPU with 1.21GB of GDDR5 on-board memory, 8GB of DDR3 RAM and operating on an Ubuntu Linux 10.04 64 bit operating system. The source extraction schemes were tested against a set of synthesized FITS sky map images. To control for fluctuations in and remove bias from the underlying environment, the desktop manager and all non-critical processes were terminated. Tests were preceded by system reboots to ensure cold start execution time was accounted for and then run for 100 iterations. The best, worst and average cases were recorded. Duchamp was utilized as a benchmark for base timing and output validation as it represents the most recent software suite used for 3D source extraction within Radio Astronomy [18]. The diversity of extra-galactic objects creates great difficulty when searching through the images and true source properties must be initially defined. Source properties are defined by the user through several input parameters. These properties effect the quantity of sources extracted, eliciting a dependency on these parameters to output validity. Dynamically adjusting these parameters is paramount to obtaining relevant and reliable results. The input parameters are tabularised in Appendix A. The functional constituents of the source extraction implementations as well as each implementation are independently measured in terms of execution time. When reporting speedup, the average execution time and variance for each implementation are shown. Reporting is achieved through the use of the gprof profiling tool as well as inline function timing.

25

5.2 Datasets Five different FITS files were used in the experiments. The FITS data files were generated by Kurt Van Heyden of the University of Cape Town’s Astronomy Department using the SKA Simulated Skies (S3) Tools provided by the University of Oxford’s SKADS program [29]. The properties of each image are summarized in Table 2. Image Reference IMG_1.fits IMG_2.fits IMG_3.fits IMG_4.fits IMG_5.fits

Dimensions 476x485x26 1200x1200x20 1800x1800x20 2000x2000x20 3596x2596x29

Physical Size 23MB 219MB 510MB 600MB 2GB

Sources 1 15 68 27 90

Table 2: FITS images used during testing, where the Image Reference is the name used to refer to each image, Dimensions are the x-, y- and z- image dimensions, respectively, Physical Size is the reported space occupied on hard disk and Sources are the amount of sources extracted by Duchamp.

5.3 Speedup Results Two separate implementations were attempted and compared against Duchamp’s source extraction routines with the aim of increasing the rate at which extra-galactic sources are be extracted from large Astronomical images. The CPU implementation is a memory managed and algorithmically optimised implementation of Duchamp that procedurally locates points of interest in the image, creates initial two-dimensional objects, amalgamates these objects into three-dimensional objects, proceeds to merge non-adjacent objects and finally rejects objects that do not satisfy minimal true source requirements. The GPU implementation effectively locates points of interest and creates three-dimensional objects. These objects do not undergo a merging phase, due to time constraints, so GPU results tend to find a larger array of unique sources. These sources then undergo pruning to ensure they meet the same minimum requirements as the CPU solution.

26

18 Duchamp

Aveage Execution Time (s)

16

CPU

14

GPU

12 10 8 6 4 2 0 0

50

100

150

200

250

300

Voxels (millions) Figure 9: The average execution time of both our implementations (CPU and GPU) relative to the size of the images examined. Duchamp’s results are included as reference performance.

Favourable results were obtained for both implementations, reducing the overall source extraction execution time (Figure 9). Our CPU implementation performs twice as fast as Duchamp in most cases, attaining a best case speedup of 2.51 for the largest data set. Furthermore, our GPU implementation accelerates this routine significantly achieving average speed up of 5.194 times faster than Duchamp, performing 5.469 times faster for the second largest data set. The linearity of these results suggests a strong positive correlation between input size and execution time. Figure 11 decomposes execution time by function, detailing the contributions of each function to overall execution time, displaying the vast proportion of execution time spent analysing the data. The disproportionate spike that skews the Duchamp results is due to the nature of the initial data set. The image contained a single, greatly zoomed in galaxy, resulting in an expensive merging routine that ultimately failed to discern the galaxy, but rather extracting 262 discrete spurious sources. Duchamp, and consequently, our implementations are designed to extract point and extended extra-galactic sources. This data is not representative of typical FITS data cubes but was included to illustrate the impact of unacceptable data on execution time (as opposed to input size). Contrariwise, the CPU implementation fails on this case for an unknown reason. This first point is occluded on the above graph by the result of our GPU implementation. Additionally, our GPU implementation fails on the largest case as the data set is too large to fit into device memory. Due to time constraints, no memory management solution was devised for the GPU. Consequently, our GPU implementation is constrained by the size of the GPUs device memory. Solutions are suggested in Section 6.2 to overcome this constraint.

27

The GPU implementation is a multiple pass CCL algorithm whose performance is directly impacted upon by the number of iterations required to label the image (as opposed to the two-pass CPU implementation that only requires two passes.) Factors that affect the number of iterations are data set dependent and are usually long and twisted components [22]. Due to the small and sparse nature of extra-galactic sources present in FITS data cubes, we found the number of iterations never exceeded four. The relationship between the sources extracted and overall execution time is explored in Figure 8. A non-linear relationship is observable probably due to the nature of these sources: their variation in size and spatial distribution. The CPU implementation reduces execution time in every case. The initial case has been omitted because no implementation, including Duchamp, achieves reliable results. Our CPU solution manages to reduce the processing time per pixel from 0.073 seconds in Duchamp to 0.014 seconds. This five times speedup can partially be attributed to Duchamp calculating and storing additional attributes per source (recording average, minimum and maximum flux per channel/object, for example) that were not explicitly required for source extraction. The CPU algorithm utilizes a significantly different approach to object merging than Duchamp. The problem was reduced from a nearest neighbour search to that of interval intersection. This approach tends to merge objects more robustly, often combining several distinct sources. However, Figure 10 illustrates that in fact our implementation extracts more sources than Duchamp. This is an artefact of erroneous testing that allowed Duchamp to still apply and re-segment already clean input images. Objects detected were often smaller, incorrectly segmented to remove dim surrounding voxels, which were later rejected as they did not satisfy the minimum source requirements.

Average Execution Time (s)

14 Duchamp

12

CPU 10

GPU

8 6 4 2 0 0

20

40

60

80

100

120

Extracted Sources

Figure 10: The average execution time of both our implementations (CPU and GPU) relative to the amount of sources extracted. Duchamp’s results are included as reference performance.

28

The GPU implementation seems unaffected by the number of sources extracted. The GPU solution did not utilize a merging routine and therefore only reports source as sets of contiguous voxels. The merging routines (Figure 11) account for substantial proportions of execution time and grow exponentially with the number of sources extracted.

Percentage of Total Execution Time

Duchamp and our CPU implementation suffer the severe overhead of searching for non-adjacent but spatially proximal sources which correlate their execution times with the number of sources. The GPU implementation thus reports a larger number of distinct sources in each case and no such relationship with the number of sources encountered. 100% 90%

Validation

80%

Object merging

70%

Object creation

60%

CCL

50% 40% 30% 20% 10% 0% 0

50

100 150 200 Voxels (millions)

250

300

Figure 11: Breakdown of CPU implementation’s execution time by function. Each family of related routines are represented by a summation of their execution times over input size.

A non-optimal implementation was used for the creation and storing of sources. Object maintenance (creation and updating) has a substantial impact on performance, consuming between 49-60% of execution time. This impact is limited by the fact that the majority of sources are point sources and small. Adjusting the initial object map size, the initialization parameters for new objects, and incorporating a growth factor, the rate at which the map extends when a pixel falls out of its current range, reduced the amount of object resizes by 318%, resulting in the performance displayed in Figure 9. The optimal object map configuration was defined by an initial map size of {30x30x1} and a growth factor of 1.2. However, although this combination reduced the overall execution time by minimizing the overhead of growing each object a multitude of time, a larger amount of memory was consumed for storing objects. In larger images the impact was defrayed by the increased execution time spent in the CCL routines. It should be noted that the object merging routine has been reduced to a minor proportion of overall execution time and increases linearly with input size.

29

The validation routine consumes a negligible portion of execution time (3.4 × 10−4 seconds), with the exception of the smallest data point where it consumes 0.1 × 10−1 seconds. The GPU implementation’s breakdown (Figure 12) illustrates a different scenario whereby each kernel’s proportion of execution time stabilises for the larger data sets. The GPU runs thousands of threads simultaneously; the initial data point does not saturate the device, resulting in many inactive threads.

Percentage of Total Execution Time

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

Labelling Kernel Resolution Kernel Scanning Kernel Initialization Kernel

0

25

50 Voxels (millions)

75

100

Figure 12: Breakdown of GPU implementation’s execution time by kernel over input size.

The scanning kernel accounts for the largest proportion of execution time. Analysing the neighbours of each voxel is an expensive procedure that does not map well to the GPU. Each thread must access positions in memory that are not sequential and additionally, may be edges cases. The GPU implementation assumes 4-connectivity (in contrast of the CPUs 8-connectivity) so as to address this issue, reducing the number of neighbours from 27 to 6. Table 3 reports the results of our experiment. Calculation of the Duchamp execution time was attained by negating execution time spent in extraneous functions (such as the statistical calculations preceding processing the cube and determining object flux density immediately afterwards) from total execution time. The lowest execution time for each data set is bolded. An ‘x’ is representative of a failed case.

Image Ref. IMG_1.fits IMG_2.fits IMG_3.fits IMG_4.fits IMG_5.fits

Duchamp Mean Std. dev. 15.655 0.10381 1.414 0.090443 5.55 0.185876 4.042 0.16208 12.471 0.183439

CPU Implementation Mean Std. dev. x x 0.794 0.16271 2.607 0.118138 2.336 0.079331 6.255 0.226472

GPU Implementation Mean Std. dev. 0.2032 0.004309 0.311 0.003213 0.624 0.009192 0.7394 0.008905 x x

Table 3: Summation of results achieved. The mean is the average execution time over 100 executions. Best performing implementations are bolded for each data set.

30

5.4 Summary Two implementations were developed and analysed for performance gains made against a Duchamp derived baseline. Both successfully reduced the execution time of the Duchamp source extraction routines with the CPU managing an average execution time of two times quicker and the GPU attaining a noticeable five times speedup. The CPU implementation for connected component labelling, based on the algorithm described in [26] is consistently faster than the Lutz implementation [25] utilized by Duchamp. Additionally, the memory management scheme deployed to import and analyse each channel reduces the memory consumption enabling significantly larger cubes to be processed. The success of the CPU implementation is hampered by our attempt to optimise the data structure used to record object positions, although it does still achieve a notable speedup. Duchamp's approach is to load the entire data cube into memory prior to processing. This places a dependency on the systems memory and virtual memory for computation and limits the effective maximum size of data cubes available for processing on the underlying system. Our approach diminishes this upper bound/constraint by loading in channel segments sequentially and on demand, reducing the amount of data in memory at any given point in time, allowing for much larger data cubes to be processed. The approximate upper bound changes from the size of the entire data cube to the size of any given channel segment. The redesign of Duchamp’s merging procedure ensures that the execution time spent forming sources is significantly reduced. This is due to the computationally cheaper (but more inaccurate) approach to detecting neighbouring sources. This is achieved by reducing the amount of comparisons required between objects by avoiding iterative checking of the nearest neighbour of each of the sources for each voxel it encompasses. This approach tends to merge distinct sources that overlap spectrally or are within the vicinity of one another, failing to discretize the sources into separate entities. The GPU implementation for connected component labelling, based on the algorithm described in [27], utilizes the architecture of the GPU by dispatching a thread for each voxel. This approach increases the number of iterations required for labelling the data, but exploits the SIMT programming model of the GPU to gain noticeable speedup. Our GPU solution fails to exploit the memory hierarchy of the device to attain the near-theoretically maximum computational horsepower of the GPU. Due to time constraints, the implementation makes uses of global memory exclusively (see Section2.4) where texture memory was recommended [27]. Additionally, the GPU only considers voxels as contiguous voxel sets, resulting in a larger set of extracted sources. Due to extensive difficulties encountered utilizing NVIDIA’s CUDA; no memory management scheme was implemented for the GPU. This places an explicit constraint on the maximum size of images that can be processed, whereby the size must be less than the devices memory.

31

Chapter 6

Conclusion Two source extraction implementations were successfully developed and deployed on both the CPU and GPU. No hybrid solution was attempted. Our source extraction implementations execute faster than Duchamp (Section 5.3), since they both reduce the overall execution time for the source extraction procedure. Our GPU implementation achieves a speedup factor of over 5 times faster than of Duchamp, resulting in a triumph relative to this project’s key success factors. The GPU implementation is constrained by the input size as a result of a fruitless memory management scheme to adequately partition the data. Additionally, no merging routine was offloaded to the GPU due to time constraints and difficulties encountered with CUDA. This directly impacts the validity of results obtained by the GPU implementation, with significantly more (in the worst case an additional 12%) spurious sources being extracted. Our CPU solution manages to reduce the processing time per pixel from 0.073 seconds in Duchamp to 0.014 seconds, achieving a speedup of 5 per pixel. However, the data structure designed to construct and store sources was sub-optimal, ultimately consuming the largest proportion of execution time. The effective CPU speedup, accounting for the sub-optimality of this data structure, resulted in an overall speedup of twice that of Duchamp. Our CPU implementation utilizes a memory management scheme to reduce the constraint on memory from the size of the entire data cube to the size of a single channel slice. Our solution therefore offers a robust, scalable solution to the issue of handling the exponentially increasing data set sizes that new telescopes will generate.

6.1 Future Work GPU Memory Management Fragmenting the data cubes into segments that map onto the GPUs on-board memory will allow the GPU solution to scale in a similar fashion as the CPU solution. Asynchronous memory transfers to and from the device can hide the latency of processing each segment in isolation. GPU Texture Memory Utilizing the texture memory cache available on the device would reduce the cost for memory accesses. Critically, the GPU implementation spends the largest proportion of its execution time in the scanning kernel, the routine that accesses the neighbouring voxels in memory to seek label equivalence. Reducing the expenses incurred during the execution of this kernel will drastically reduce overall execution for the GPU solution. Additionally, the use of shared memory within each thread block can resolve the label equivalences within the thread block, in effect reducing the number of iterations required to label the entire image.

32

Source Representation A significant proportion of execution time is spent constructing and updating the internal data structure used to house each source’s voxel set and properties. An investigation into an optimal data structure for the amalgamation and storage of three-dimensional objects will reduce the fraction of execution time spent during object creation and reduce the overhead managing these objects in memory. Sensitive Merging A crude approach to merging was implemented in an attempt to reduce the amount of execution time spent comparing and joining sources. The merge routine is insensitive to the shape of irregular objects (such as those with an elongated side lobe or those with concavities). An optimal nearest neighbour search will increase the sensitivity of the merge operation to align more precisely with the results obtained with Duchamp.

33

References [1]

[2] [3] [4]

[5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16]

[17] [18]

[19]

[20]

[21]

[22]

Miller, D. F., Basics of Radio Astronomy for the Goldstone-Apple Valley Radio Telescope, 1997. Retrieved April 12, 2010 from NASA Jet Propulsion Laboratory: http://www2.jpl.nasa.gov/radioastronomy/ Thompson, A. R., Moran, J. M. & Swenson, G. W., Jr., Interferometry and Synthesis in Radio Astronomy, 2nd edition, Wiley & Sons, New York Press, 2001. Bertin, E. &Arnouts, S., SExtractor: Software for source extraction. Astronomy & Astrophysics Supplement Series, vol. 117, 393-404 (1996) Hopkins, A. M., Miller, C. J., Connolly, A. J., Genovese, C., Nichol, R. C. & Wasserman, L., A New Source Detection Algorithm using the False-Discovery Rate. The Astronomical Journal, vol. 123, 1086–1094 (2002) Yee, H.K.C., A Faint-Galaxy Photometry and Image-Analysis System, Astronomical Society of the Pacific, vol. 103, 396-411 (1991) Kron, R. G., Photometry of a complete sample of faint galaxies, Astrophysical Journal Supplement Series, vol. 43, 305-325 (1980) Jarvis, J. F. & Tyson, J. A., FOCAS - Faint Object Classification and Analysis System, Astronomical Journal, vol. 86, 476-495 (1981) Drory, N., Yet another object detection application (YODA) Object detection and photometry for multi-band imaging data, Astronomy & Astrophysics, vol. 397, 371–379 (2003) Sault, R. J. &Oosterloo, T. A., Imaging Algorithms in Radio Interferometry, Oxford University Press, England, 1996. Cornwell, T. & Wilkinson, P. N., A new method for making maps with unstable radio interferometers, Monthly Notices of the Royal Astronomical Society, vol. 196, 1067-1086 (1981) Cornwell, T., Very Long Baseline Interferometry and the VLBA, Astronomical Society of the Pacific, vol. 82, 39-56 (1995) Rau, U., Bhatnagar, S., Voronkov, M. A. & Cornwell, T. J., Advances in Calibration and Imaging Techniques in Radio Interferometry, Proceedings of the IEEE, vol. 97, 1472 (2009) Greisen, E.W., AIPS, the VLA, and the VLBA, Astrophysics and Space Science Library, vol. 285, 109-125 (2003) Norris, R. P., Very high angular resolution imaging, in Proceedings of the 158th International Astronomical Union (IAU) Symposium, (Sydney, Australia, 1993), Kluwer Academic Publishers, 247 Common Astronomy Software Applications: User Manual, 2010. Retrieved 16 April, 2010, from US National Astronomy Observatory: http://casa.nrao.edu/docs/userman/UserMan.html Multichannel Image Reconstruction, Image Analysis and Display: User Guide, 2009. Retrieved 14 April, 2010, from Australia Telescope Compact Array: http://www.atnf.csiro.au/computing/software/miriad/userguide/userhtml.html Sault, R. J., Teuben, P. J. & Wright, M. C. H., A Retrospective View of MIRIAD, in Astronomical Data Analysis Software and Systems IV, (1995), ASP Conference Series vol. 77, 433-436 Whiting, M., Source Detection with Duchamp: A User's Guide, 2010. Retrieved 14 April, 2010, from Australia Telescope National Facility CSIRO: http://www.atnf.csiro.au/people/Matthew.Whiting/Duchamp/ Castaño-Díez, D., Moser, D., Schoenegger, A., Pruggnaller, S. &Frangakis, A. S., Performance evaluation of image processing algorithms on the GPU, Journal of Structural Biology, Volume 164, 153-160 (2008) Bui, P. and Brockman, J. 2009. Performance analysis of accelerated image registration using GPGPU, in Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units (Washington, D.C., 2009). GPGPU-2, vol. 383,New York, 38-45 Podlozhnyuk, V., Image Convolution with CUDA, 2007. Retrieved 02 May, 2010, from NVIDIA Corporation: http://developer.download.nvidia.com/compute/cuda/sdk/website/ project/convolutionSeparable/doc/convolutionSeparable.pdf Wu, K., Otoo, E. & Suzuki, K., Optimizing two-pass connected component labelling algorithms, Pattern Analysis and Applications, vol.12, 117-135 (2009)

34

[23]

[24] [25] [26] [27] [28]

[29] [30] [31]

Suzuki, K., Horiba, I. & Sugie, N., Fast connected-component labelling based on sequential local operations in the course of forward raster scan followed by backward raster scan, Proceedings of the th 15 International Convention on Pattern Recognition, vol. 2, 434-437 (2000) Suzuki, K., Horiba, I. & Sugie, N., Linear-time connected component labelling based on sequential local operations, Computer Vision and Understanding, vol. 89(1), 1-23 (2003) Lutz, R. K., An Algorithm for the Real Time Analysis of Digitised Images, The Computer Journal, vol. 23(3), 262-269 (1980) He, L., Chao, Y., Suzuki, K. & Wu, K., Fast connected-component labelling, Pattern Recognition, vol. 42, 1977-1987 (2009) Hawick, K. A., Leist, A. & Playne, D. P., Parallel Graph Component Labelling with GPUs and CUDA, Parallel Computing, pre-print submission April (2010) CUDA Programming Guide, version 3.1, 2010. Retrieved 30 October, 2010 from NVIDIA Corporation: http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/ docs/NVIDIA_CUDA_C_ProgrammingGuide_3.1.pdf SKA Simulated Skies (S3) Tools. Retrieved 16 July, 2010 from the University of Oxford: http://www.lra.ens.fr/~levrier/Recherche/S3/ Whiting, M., Duchamp Source Finder, version 1.1.9, 2010. Retrieved 12 May, 2010 from the Australia Telescope National Facility: http://www.atnf.csiro.au/people/Matthew.Whiting/Duchamp/ Badenhorst, S., GPU Accelerated Noise Removal in Radio Astronomy: A CUDA Implementation, Honours Thesis (2010)

35

Appendix A Input Parameters The sets of parameters used during testing for Duchamp and our implementation are included in this appendix.

Duchamp Parameters Image to be analysed.........................[imageFile] = /home/gary/ParalleX/test--input/galaxy1.cm.fits Intermediate Logfile...........................[logFile] = logfile.txt Final Results file.............................[outFile] = results.txt Spectrum file..............................[spectraFile] = spectra.ps 0th Moment Map...............................[momentMap] = duchamp-MomentMap.ps Detection Map.............................[detectionMap] = duchamp-DetectionMap.ps Display a map in a pgplot xwindow?.........[flagXOutput] = false Saving mask cube?.......................[flagOutputMask] = false Saving 0th moment to FITS file?.........[flagOutputMask] = false Type of searching performed.................[searchType] = spatial Trimming Blank Pixels?........................[flagTrim] = false Searching for Negative features?..........[flagNegative] = false Removing Milky Way channels?....................[flagMW] = false Removing baselines before search?.........[flagBaseline] = false Smoothing data prior to searching?..........[flagSmooth] = false Using A Trous reconstruction?...............[flagATrous] = false Using Robust statistics?...............[flagRobustStats] = false Using FDR analysis?............................[flagFDR] = false SNR Threshold (in sigma)........................[snrCut] = 0.1 Minimum # Pixels in a detection.................[minPix] = 100 Minimum # Channels in a detection..........[minChannels] = 2 Growing objects after detection?............[flagGrowth] = false Using Adjacent-pixel criterion?...........[flagAdjacent] = false Max. spatial separation for merging......[threshSpatial] = 3 Max. velocity separation for merging....[threshVelocity] = 5 Reject objects before merging?........[flagRejectBeforeMerge] = false Merge objects in two stages?..........[flagTwoStageMerging] = false Method of spectral plotting.............[spectralMethod] = peak Type of object centre used in results......[pixelCentre] = centroid

36

Sample Configuration image = { path = "test--input/galaxy1.cm.fits"; subsection = { x-axis = "*"; y-axis = "*"; z-axis = "*"; }; }; logging = { enabled = TRUE; level = "DEBUG"; /* WARNING, ERROR, CRITICAL, DEBUG */ }; detection = { threshold = { sigmaclip = FALSE; fdr = { enabled = TRUE; alpha = 0.05; channel_correlation = 2; }; }; merging = { adjacency_required = FALSE; max_spatial_separation = 3; max_spectral_separation = 5; }; rejection = { min_channel_span = 2; min_pixels = 100; min_channels = 2; }; }; alteration = { a_trous = { output_file = "a_trous_output.fits"; enabled = FALSE; min_scale = 1; max_scale = 0; signal_to_noise_recon_cutoff = 4; }; };

37