C OV ER F E AT U RE
Computing in Astronomy: Applications & Examples
C
omputing in astronomy has a wide variety of applications. We grouped a series of sidebars to provide pointers to ongoing work as well as to
give illustrative examples on astronomical visualization, event classification, big data, and software libraries. —Victor Pankratius and Chris Mattmann, guest editors
Visualizing the Universe: Using Modern Graphics Cards to Understand the Physical World F. Alexander Bogert, Nicholas Smith, and John Holdener, University of California, Santa Cruz
An open source solution for visualizing large, complex datasets uses CUDA to let researchers interact with their data in real time. This approach also provides visualization options such as stereoscopic rendering and remote streaming.
40
COMPUTER
I
n the University of California, Santa Cruz’s, Astrophysics 3D Visualization lab, we are developing a module for the yt-project, an open source project for analyzing and visualizing scientific data that performs volume rendering on an Nvidia graphics card.1 We use CUDA to create ray-casting algorithms, moving the volume rendering of the numerous data structures that yt can parse onto the graphics card, which lets users interact with the data in a 3D space in real time. yt is an openly developed and freely available analysis and visualization system for volumetric data. Figure 1 shows an example of how yt can be used to explore scientific data. Originally applied to astrophysical simulations,
Published by the IEEE Computer Society
0018-9162/14/$31.00 © 2014 IEEE
6 1e9
105
4
104 103
2 0
101
Beta
102 z (cm)
researchers have extended it to work with radio telescope data, seismological simulations, high energy-density physics data, and nuclear engineering simulations. At its core, yt provides a method of describing physical rather than computational objects inside an astrophysical simulation. Specifically, it provides methods for selecting regions, applying analysis to regions, visualizing (including volume rendering, projections, slices, and phase plots), and exporting data to external analysis packages. yt lets astrophysicists think about physical questions that drive their research, rather than the computational and technical steps required to answer them. For example, adaptive mesh refinement (AMR) data consists of cells or grid patches that can be refined to a higher resolution and overlap with coarser objects. yt, when invoked on an AMR dataset, will transparently select data by masking the coarse cells, converting code units to physical units, processing that data, optionally decomposing to multiple processors for parallel analysis, and then returning the reduced data product or visualization to the user. yt was designed to enable separation between data indexing or selection (such as patch-based AMR, N-body datasets, octrees, and irregular or unstructured meshes) and data processing to produce astrophysicallymeaningful outputs. This enables researchers to use identical routines for generating simulated spectra and observations independent of the underlying simulation code, providing direct cross-code comparisons and technology transfer. A fundamental component of yt is its investment in community: by providing different levels of contribution (infrastructure, code interfaces, analysis modules, documentation, and engagement), it has scaled to contributions from 50 different people, with an active development process and user community spread across the globe. We have performed some initial benchmarks to show the extreme advantage of volume rendering on the graphics card in comparison with CPUs. We generated an example video (see http://youtu.be/Xd59o4q15Y4) in real time, at 30 fps, using our CUDA ray-caster and a third-party video-capturing program (http://en.wikipedia.org/wiki/ SimpleScreenRecorder). yt’s current ray-caster is software based, making processing significantly slower. To create a video with the same frame rate and length, thousands of individual images must be generated and assembled. Running on one core of an Intel Xeon E5-4620, each image takes 4.8 seconds to render, requiring hours of CPU time to make the same video. Our results show that providing yt with a graphics suite that takes advantage of Nvidia CUDA cores will be a powerful contribution to the community, and we hope to extend visualization support to numerous simulation codes. Once completed, our module will equip yt users with stunning visuals of star birth, galaxy formation, the
–2
100
–4
10–1 10–2
–6 0.0
0.2
0.4
0.6 r (cm)
0.8
1.0
1.2 1e10
Figure 1. Slice plot of a magnetized white dwarf binary merger.2 A frame in the cylindrical coordinate’s r-z plane consists of the ratio of gas pressure to magnetic pressure, or ß value. (Figure taken from http://yt-project.org/gallery.html.)
cosmic web of dark matter, and many other astrophysical phenomena. We believe both the public and the scientific community will benefit greatly from these advances in visualization software.
Acknowledgments We explicitly thank Matt Turk, Linda Werner, Joel Primack, Enrico Ramirez-Ruiz, and HIPACC for all their support on this project.
References 1. M.J. Turk et al., “yt: A Multi-Code Analysis Toolkit for Astrophysical Simulation Data,” Astrophysical J. Supplement Series, vol. 192, no. 1, 2011, p. 9. 2. S. Ji et al., “The Post-Merger Magnetized Evolution of White Dwarf Binaries: The Double-Degenerate Channel of Sub-Chandrasekhar Type Ia Supernovae and the Formation of Magnetized White Dwarfs,” arXiv preprint 2013; arXiv:1302.5700.
F. Alexander Bogert is a visualization director at the University of California, Santa Cruz (UCSC). Contact him at
[email protected]. Nicholas Smith is a recent graduate from UCSC. Contact him at
[email protected]. John Holdener is a recent graduate from UCSC. Contact him at
[email protected].
AUGUST 2014
41
C OV ER F E AT U RE
Visualizing Big Data in Astronomy: The Automated Movie Production Environment Distribution and Display (AMPED) Pipeline Eric M. De Jong, Jet Propulsion Laboratory
Figure 1. A single frame from a high-definition simulated flight over the Mojave Crater on Mars. The frame is based on stereo images taken by the High-Resolution Imaging Science Experiment (HiRISE) camera on NASA’s Mars Reconnaissance Orbiter.
Storing, transporting, analyzing, and visualizing rapidly growing quantities of data is a significant challenge in astronomy. The Automated Movie Production Environment distribution and Display (AMPED) Pipeline is the visualization component of the Jet Propulsion Laboratory’s Research and Technology Development Astronomy Big Data Initiative that aims to enhance dataset usage through “movie” visualization.
42
COMPUTER
S
toring, transporting, analyzing, and visualizing rapidly growing quantities of data is a significant challenge in astronomy. To tackle it, the Jet Propulsion Laboratory’s Research and Technology Development Astronomy Big Data Initiative uses the Automated Movie Production Environment distribution and Display (AMPED) pipeline as a visualization component. Astronomy datasets include complex sets of parameters that change with time or other independent parameters, and the analysis of these datasets is greatly enhanced through “movie”-style visualizations. The system uses new event and feature detection codes, metadata-driven procedures, and custom search and pattern recognition algorithms to create movies from astronomy simulation, observation, and archive datasets. In so doing, AMPED helps scientists visualize mission
operations, observations, features, targets, and dynamic processes. The system’s inputs include mission plans, spacecraft commands, feature catalogs, models, simulations, observations, metadata, and events. It exploits expert knowledge to create time-, feature-, and eventdriven algorithms and procedures. AMPED movies enable rapid comparison of observations and models at a variety of temporal-spatial scales and viewpoints. Scientists can watch changes that occur over decades, centuries, millennia, and geologic epochs in a few minutes on screen. Frames can be sped up, slowed down, edited, repeated, and looped to focus on particular events. Vertical, horizontal, and temporal scales can be exaggerated to enhance small-scale features. Infrared and radar observations can be mapped into the visible spectrum, with the ability to modify the contrast, intensity, color, hue, and saturation to highlight differences. Comparison of model predictions with observations is the ultimate test of a scientific model or theory. Experimental data records (EDRs) and reduced data records (RDRs) contain instrument observations and metadata and are stored in a variety of formats. EDR and RDR metadata contains hundreds to thousands of unique parameter equal-value pairs. Metadata parameters include observation time and date, spatial-temporal reference data, illumination angle, instrument pointing, and field of view. AMPED includes a file transformation (FileTrans) code that automatically translates files form one EDR/RDR format to another. This code also creates multiple versions of AMPED movies for distribution.
Figure 1 is a single frame from an AMPED movie that simulates a flight over the Mojave Crater on Mars. The virtual camera’s altitude is approximately 3 kilometers above the Martian surface. A 3D model of this crater was created from a stereo pair of Mars Reconnaissance Orbiter (MRO) High-Resolution Imaging Science Experiment (HiRISE) observations of Mojave crater. The MRO spacecraft has collected more than 200 Tbits of scientific data as it orbits 300 kilometers above Mars’ surface. The 3D model’s image resolution is 25 centimeters/pixel which enables scientists to explore the details and dynamics of geologic features including polar caps, volcanoes, craters, dunes, valleys, lakebeds, and river networks.
A
MPED connects astronomers to space- and groundbased observations, numerical models, simulations, advanced computer graphic visualization, and image-processing and analysis tools. The AMPED pipeline has been adapted to support several NASA science missions. AMPED movies are delivered online to science team members, museums, planetariums, science centers, mission, and social media websites. Eric M. De Jong is a planetary scientist at NASA’s Jet Propulsion Laboratory, an adjunct professor of astronomy and physics at the University of Southern California, and a visiting associate for planetary science at Caltech. Contact him at
[email protected].
C OV ER F E AT U RE
Supporting Distributed, Collaborative Review and Classification of Fast Transient Events Andrew F. Hart, Luca Cinquini, Shakeh E. Khudikyan, David R. Thompson, Chris A. Mattmann, Kiri Wagstaff, Joseph Lazio, and Dayton Jones, Jet Propulsion Laboratory
We discuss recent efforts utilizing open source software and machine learning to develop a metadata-driven, data collection and candidate review framework to support a modern, high-throughput radioastronomy experiment investigating fast transient events. Metadata from participating instruments is collected and cataloged in real time, and feedback from distribut-
ed science team members, collected via a web-based user interface, is used to refine machine learning approaches to the identification and classification of salient radio transient events.
A
n increasingly pervasive challenge within modern radio astronomy experiments is the rapid identification and classification of scientifically
AUGUST 2014
43
C OV ER F E AT U RE interesting information from the vast amounts of information currently being collected. With instruments generating ever-greater volumes of data, there is often pressure to filter, classify, and reduce data as quickly and as efficiently as possible to optimize usage of available compute and storage resources while maximizing the scientific return on investment. Incorporating automation and machine-learning techniques into data-processing pipelines to transform raw sensor data into usable scientific analysis products is an increasingly essential approach to addressing this challenge. In many cases, however, complete and accurate automation of the full process of identification, validation, and classification of interesting events remains an open research problem. A balance must be struck between utilizing automation where it is essential and facilitating efficient human review where it is unavoidable. Where this hybrid approach is required, it is necessary to provide tool support that is sensitive to the natural workflow of the human domain expert and that promotes efficient review without compromising accuracy. Additionally, many larger radio astronomy experiments require a coordinated effort by teams of scientists who might be separated geographically. Software tools must take this distributed collaboration model into account and support the collection and collation of asynchronous contributions from different team members. Our team at the NASA Jet Propulsion Laboratory has collaborated with science team members for the VLBA Fast Transient Experiment (V-FASTR), a software-based detection system installed at the National Radio Astronomy Observatory’s Very Long Baseline Array (VLBA) instrument, to develop a set of software components for the rapid evaluation, classification, and archiving of fast radio transient detections. “Fast radio transients” are bright millisecond pulses of radio-frequency energy—short-duration pulses produced by known objects such as pulsars or potentially by more exotic objects such as evaporating black holes. The identification and verification of such an event would be of great scientific value. V-FASTR uses a “commensal” (piggyback) approach, analyzing array data collected during routine VLBA observations and identifying candidate fast transient events. To support the timely analysis of fast transient candidates by V-FASTR scientists, we developed a metadata-driven, collaborative candidate review framework. The framework has two principal components: a processing pipeline for extracting and cataloging the metadata for candidate events, and an interactive, Web-based data portal to facilitate collaborative review and classification of these event candidates. The system implementation heavily leverages open source software. Apache OODT, a data integration framework from the Apache Software Foundation, provides the
44
COMPUTER
backbone for the processing pipeline’s data and metadata management aspects. We employ Apache Solr, a widely used open source search engine, for rapid querying, subsetting, and facet-based search on the review portal; data for candidate transient events is processed via OODT, and the extracted metadata indexed into Solr. Solr’s expressive query support drives the dynamic Web portal search capability. These two flexible open source software products enabled us to rapidly develop a robust data management system for the experiment. Although the framework is still under development, the prototypical version made available to the V-FASTR science team was enthusiastically received and has been incorporated into the team’s candidate review workflow. Although we are still gathering detailed usage statistics, anecdotal evidence suggests that efficiency has been improved. Using this combination of statistical and subjective feedback will assist us in refining subsequent versions of the software. Andrew F. Hart is CTO of Pogoseat, but at the time of this writing, he was a senior software engineer at NASA Jet Propulsion Laboratory. Contact him at
[email protected]. Luca Cinquini is a senior software engineer at NASA Jet Propulsion Laboratory. Contact him at Luca.cinquini@jpl. nasa.gov. Shakeh E. Khudikyan is a software e nginee r at NASA Jet P ropulsion Laboratory. Contact he r at
[email protected]. David R. Thompson is a research technologist at NASA Jet Propulsion Laboratory. Contact him at david.r.thompson@ jpl.nasa.gov. Chris A. Mattmann is chief architect in the Instrument and Science Data Systems group at NASA Jet Propulsion Laboratory and adjunct associate professor at the University of Southern California. Contact him at mattmann@ jpl.nasa.gov. Kiri Wagstaff is a machine learning expert at NASA Jet Propulsion Laboratory. Contact her at Kiri.wagstaff@jpl. nasa.gov. Joseph Lazio is a chief scientist at NASA Jet Propulsion Laboratory. Contact him at
[email protected]. Dayton Jones is a principal scientist at NASA Jet Propulsion Laboratory. Contact him at
[email protected].
Big Data Technologies at JPL Dayton L. Jones, Jet Propulsion Laboratory
executes
Resource manager
Workflow manager
manages
Workflow model (XML-based) instantiates
sends workflow task job
WPT
Computing resources PGE
PGE
PGE
cf
File manager
gets info for config file injests
workflow instance
produces Crawler
crawls
m
p
p m
Figure 1. Process control system architecture showing how the Object Oriented Data Technology (OODT) components interact. (Figure courtesy of C. Mattmann.)
The author summarizes recent work at the Jet Propulsion Laboratory in four technology areas needed for solving big data challenges: low-power signal processing, real-time analysis using machine learning algorithms, scalable data archiving, and mining and data visualization. While the needs of future large radio astronomy arrays motived much of this work, the same challenges appear in a wide range of research fields and industries.
T
he Jet Propulsion Laboratory ( JPL) in Pasadena invests in big data technologies in four broad areas: low-power signal processing, real-time data triage, scalable and flexible data movement, archiving, and mining and data visualization via automated video generation. Power consumption can dominate the operating costs of digital signal processing systems, which are the source of big data in next-generation astronomical instruments and other large-scale science facilities. JPL is developing ASIC architectures that can dramatically reduce data transport between IC chips, and thus power consumption, during the major steps in radio array signal processing. This includes
beamforming, autocorrelation, and cross-correlation of data from up to thousands of individual antennas. Detecting transient astronomical signals that last much less than one second is an area of rapidly increasing research interest. With raw data generation rates far too high to store for more than a few seconds, it is essential to make fast and reliable decisions about the reality and scientific value of potential fast transient signals in real time. This allows a detected event’s unaveraged data to be archived for more careful analysis offline. Machinelearning algorithms are well suited to this task, and JPL has applied several data-adaptive techniques to an ongoing, commensal search for fast transient signals at the National Radio Astronomy Observatory’s Very Long Baseline Array. These techniques have demonstrated an increase in real event detection probability and, more importantly, a large reduction in the rate of false detections. In addition, the machine-learning group has developed improved transient classification algorithms for optical surveys, including the Palomar Transient Factory, which is an automated sky survey to detect supernovae and other variable objects. The planned next generation of such surveys, the Large Synoptic Survey Telescope (LSST), will produce millions of transient detections every night. Another, more JPL-specific machine learning application is to reduce the data downlink requirements of future space missions by extracting only the most important data produced by onboard instruments. An example of this is the autonomous selection of
AUGUST 2014
45
C OV ER F E AT U RE spectral features from the ChemCam instrument on the Curiosity Mars rover. A common feature of most big data challenges, including those in astronomy, is the need for archives that can handle massive data quantities in multiple heterogeneous formats and allow efficient remote input, access, and mining of that data. JPL’s open source Object Oriented Data Technology (OODT) tools, and the associated process control system (PCS), were developed for Earth science archive, but have proven valuable for several large astronomical data archives and other applications in the medical and climate-monitoring fields. The OODT framework allows efficient collaborative decision making on data classification and mining by remote users. As Figure 1 shows, it includes a crawler and metadata extractor for data ingestion, a resource manager, and a workflow engine. These tools are the first major NASA software to be supported by the Apache Software Foundation. Accumulating huge quantities of complex data is of limited value if new knowledge cannot be extracted from them. Data visualization technologies can help identify interesting
subsets of data, trends, and unexpected features, and then provide relevant information in a human-understandable format. This approach will become ever more essential as the possibility of human examination of raw data recedes. Nearly all work described here was initially motivated by the needs of future radio astronomy instruments, such as the Square Kilometre Array (SKA), which face great challenges in managing petabyte per second data rates and exabyte data storage requirements. However, this research is clearly applicable to a wide variety of big data challenges in other fields as well.
Acknowledgments This work was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under contract with the National Aeronautics and Space Administration.
Alice Allen, Astrophysics Source Code Library Lior Shamir, Lawrence Technological University Peter Teuben, University of Maryland
T
he Astrophysics Source Code Library (ASCL; http:// ascl.net) is a free online registry of scientist-written software used in astronomy research. With over 900 entries, it covers a significant number of the astrophysics source codes used in peer-reviewed studies. ASCL editors examine both new and old peer-reviewed papers, actively seeking those that describe methods or experiments that involve the development or use of software;
46
COMPUTER
References
Dayton L. Jones is a principal scientist at the Jet Propulsion Laboratory, California Institute of Technology. Contact him at
[email protected].
The Astrophysics Source Code Library: ascl.net
Research in astronomy depends on software that often is not easily accessible to other researchers. In the interest of transparency and reproducibility, the Astrophysics Source Code Library indexes this software and provides a way for scientists to find and cite it.
the registry lists programs with various open source licenses as well as those that are copyrighted or in the public domain. Entries can be browsed in alphabetical or date-added order, and the ASCL also offers a full-text search function. As projects generating massive databases that require computational methods to manage, store, process, analyze, and mine them continue to proliferate,1 research has become dependent on those computational methods.2 It is increasingly important to make the software used for research discoverable to ensure the transparency and reproducibility of our science. Just as journals provide the method for publishing papers and the Virtual Observatory provides data discovery resources, the ASCL provides an infrastructure for publishing and discovering source code in the discipline.
1. L. Shamir et al., “Practices in Source Code Sharing in Astrophysics,” Astronomy and Computing, vol. 1, 2013, pp. 54–58. 2. P. Teuben et al., “Practices in Code Discoverability,” Astronomical Data Analysis Software and Systems XXI, Astronomical Soc. of the Pacific, 2012, pp. 623–626.
Alice Allen is editor of the Astrophysics Source Code Library. Contact her at
[email protected]. Lior Shamir is an associate professor of computer science at Lawrence Technological University. Contact him at
[email protected]. Peter Teuben is a research scientist in the astronomy department at the University of Maryland, College Park. Contact him at
[email protected].
they then add entries for the found codes to the library. This approach ensures that authors are not required to actively submit anything—although author submissions are welcome—resulting in a comprehensive and growing resource. On average, editors add 15 codes each month. Software in the ASCL covers all areas of astrophysics and computational tasks in the field, including N-body simulations, stellar and galactic evolution, astronomical image processing, data reduction, hydrodynamics, Bayesian methods, gas dynamics, and cosmology. A wide variety of languages, including Python, Fortran, IDL, MatLab, Java, C and C++, are represented as well. ASCL editors assign a unique identifier to each code, and this ascl ID becomes part of the entry’s permanent link. Editors assess submitted software entries before assigning this ID because all codes with an ascl ID are indexed by the Astrophysics Data System (ADS), the primary resource for searching and browsing publications in astronomy. Software in the ASCL need not be in the public domain;
Selected CS articles and columns are available for free at http://ComputingNow.computer.org.
AUGUST 2014
47