Mar 21, 2017 - for each of the three regions of the interstellar medium that we consider. ... dominated by the differential rotation of the Galaxy, produces a line-of-sight velocity v ..... call this the cross neighbourhood, which is what was used in the discussion of ..... Finally for this section we note that the well-known bottleneck ...
arXiv:1703.07256v1 [stat.AP] 21 Mar 2017
Statistical Topology and the Random Interstellar Medium Robin Henderson, Irina Makarenko, Paul Bushby, Andrew Fletcher, Anvar Shukurov ∗ School of Mathematics & Statistics, Newcastle University, UK March 22, 2017
Abstract Current astrophysical models of the interstellar medium assume that small scale variation and noise can be modelled as Gaussian random fields or simple transformations thereof, such as lognormal. We use topological methods to investigate this assumption for three regions of the southern sky. We consider Gaussian random fields on two-dimensional lattices and investigate the expected distribution of topological structures quantified through Betti numbers. We demonstrate that there are circumstances where differences in topology can identify differences in distributions when conventional marginal or correlation analyses may not. We propose a non-parametric method for comparing two fields based on the counts of topological features and the geometry of the associated persistence diagrams. When we apply the methods to the astrophysical data, we find strong evidence against a Gaussian random field model for each of the three regions of the interstellar medium that we consider. Further, we show that there are topological differences at a local scale between these different regions.
Keywords: astrophysics, Betti numbers, convex hull, filamentarity, nonparametric test, persistence diagram, random field. ∗
The authors gratefully acknowledge support of the Leverhulme Trust, grant RPG-2014-427
1
1
Introduction
The stars of the Milky Way and other galaxies are embedded in the interstellar medium (ISM), a mixture of gas, cosmic rays and magnetic fields. The ISM is an important and active ingredient in the Galactic system despite comprising only about 10% of the total baryonic mass of the Galaxy (Ferri`ere 2001). New stars form from cold, dense, parts of the ISM, while stellar evolution driven by energy release from thermonuclear reactions returns some of the stellar mass to the ISM via stellar winds and supernova explosions. This injection of energy generates turbulent motions and shocks in the ISM, producing a highly heterogeneous random structure. Accurate knowledge of the spatial distribution of the ISM is required to understand the properties and evolution of galaxies. One way to probe the ISM is to observe neutral atomic hydrogen (H i), as about 90% of atoms in the interstellar gas are hydrogen (Kalberla & Kerp 2009). H i emits and absorbs radio waves at the frequency of 1420 MHz and large data sets are now available for detailed analysis. Figure 1 shows H i distribution in a section of the southern sky. These data were obtained by the Galactic All-Sky Survey (GASS) using the Parkes 64 m radio telescope (McClure-Griffiths et al. 2009, Kalberla et al. 2010). The second and third releases of the data are available at http://www.astro.uni-bonn.de/hisurvey. The figure shows the antenna-temperature distribution T (l, b), which is related to the gas density, as a function of position on the sky, using the coordinates of Galactic longitude, l, and latitude, b. The distance to the gas cannot be measured directly, but the Doppler-shift of the emission, dominated by the differential rotation of the Galaxy, produces a line-of-sight velocity v that can in principle be used to determine the location of the gas in three dimensions. The transformation is complicated, however, and not necessary for our purposes. Instead, we obtained the two-dimensional data in Figure 1 by integrating T over velocities from
2
Latitude b, [°]
-40
60
-45
40
-50
20
-55 40
20
0
340
320
300 280 Longitude l, [°]
260
240
220
200
Figure 1: Emission of neutral atomic hydrogen H i in a region of the Southern sky (McClureGriffiths et al. 2009). The coordinates are Galactic longitude l and latitude b, centered on the Sun, where the mid-plane of the Galactic disc is at b = 0◦ and the Galactic centre is in the direction l = 0◦ . The brightness temperature shown is proportional to the total mass of H i along the line of sight. Dashed squares from left to right identify the Regions 1–3 used in Section 6. v = 20.6 to v = 40.4 km/s. The GASS data and other surveys are rich enough to allow subtle comparisons between observations and the results of sophisticated magneto-hydrodynamic (MHD) simulations of the ISM. The data are represented as random fields with large-scale gradients and a complex topology widely believed to be related to turbulence and outflows from the Galactic disk. At a more local level, Gaussian random fields (GRF), or simple transformations thereof, have underpinned the modelling of small-scale variations in H i. In this paper we investigate whether these models are sufficient to describe the small-scale properties of T (l, b). We consider the three regions marked by dashed lines in Figure 1, which we refer to as Regions 1–3, moving from left to right. Each consists of a 256 × 256 array of temperature values. The selection of the regions was arbitrary: we did not consider any astronomical information about the locations when drawing their boundaries. In order to concentrate on small-scale variation we removed the trend from the plots by fitting to each region a 3
polynomial surface of order four in each of l and b. The residuals were then marginally transformed to N(0,1). If this transformation results in a realisation of a GRF then all information would be captured by the correlation function. We therefore consider two questions. Q1. Are the transformed data sets consistent with stationary isotropic Gaussian random fields? Q2. Are there differences between the three data sets, to which the correlation function is insensitive? We will address both questions using techniques in topological data analysis, which is becoming a popular approach to the analysis of random fields and more generally (Adler et al. 2010, Adler & Taylor 2011, Bubenik 2015, Carlsson 2009, Edelsbrunner 2014, Fasy et al. 2014, Yogeshwaran & Adler 2015). Topological invariants such as Betti numbers, the Euler characteristic, persistence diagrams and persistence barcodes, rank functions and landscapes, have been used in areas such as astrophysics (Li et al. 2016), cosmology (Gay et al. 2010, Sousbie 2011, Sousbie et al. 2011, Pranav 2015), fluid dynamics (Kram´ar et al. 2016, Li et al. 2016) and medicine (Davis 2008, Chung et al. 2009, 2015). A difference in our case compared with most previous work however is that we have just a single observation for each region, so that inferential techniques based on sampling and asymptotics are not appropriate. In Section 2 we describe several topological summaries that are appropriate for data on two-dimensional lattices. In Section 3 we study characteristics when a GRF is appropriate. In Section 4 we demonstrate that non-Gaussian random fields can sometimes be distinguished by topological features even though first and second order properties (marginal
4
distribution and correlation function) are the same. In Section 5 we propose a simple procedure for comparing two single realisations of random fields and in Section 6 we describe our analysis of the GASS data.
2
Topological descriptors of a random field
Here we describe a number of topological measures that are suitable for analysing data that are distributed on a two-dimensional rectangular lattice. For more general definitions and interpretations and further information see, for instance, Adler et al. (2004), Bubenik (2015) or Fasy et al. (2014).
2.1
Level sets, persistent homology and persistence diagrams
Let z(x) be the value of a random field at location x on a two-dimensional lattice. For any real t, the lower-level set is defined as the locations that have field values below t, Ft = {x : z(x) ≤ t}. Increasing t from below defines a filtration which is used in persistent homology to describe the evolution of topological structures in the field. In our context, there are two topological features of interest, namely components and holes, whose counts in a level set determine the Betti numbers of order zero and one, β0 and β1 , respectively (Carlsson 2009). Figure 2 shows four lower-level sets for a simulated field on a 10 × 10 lattice and will be used to illustrate some basic concepts. A component is a group of one or more pixels in a lower level set that are connected to each other, where for now we define neighbouring pixels to be connected if they have a common edge, and non-neighbouring pixels to be connected if there is a path of connected 5
(a) t=−2
(b) t=−1
(c) t=0
(d) t=1
Figure 2: Four level sets (black) for a field on a 10 × 10 lattice. neighbours between them. The Betti number of order zero, β0 , for Ft is the number of components in the level set. In Figure 2 for instance, in panel (a) at t = −2 we have β0 = 3, as we have assumed that pixels that share only a vertex are not connected. By t = −1 and panel (b) we have β0 = 7, and then β0 = 5 and β0 = 1 in panels (c) and (d) respectively. The emergence of a new component is described as a birth and the merger of two components is interpreted as the continuation of the component with the earlier birth time and the death of the other. In a two-dimensional field each local minimum is 6
associated with the birth of a component. A hole is a group of one or more pixels that are not in the level set, are connected to each other but are isolated from other pixels that are also outside the level set. The Betti number of order one, β1 , for Ft is the number of holes in the level set. Thus in panel (d) of Figure 2 we have β1 = 9 as we ignore common vertices. In panel (c) we have β1 =5, in (b) β1 = 2 and in (a) we have β1 = 1. New holes are created when existing ones split, and again we can define their birth and death levels t. The death of a hole is associated with a local maximum. Holes could alternatively be defined by symmetry as components in a similarly defined upper level set, with the filtration now running from high to low t. We will always refer to lower level sets so as to avoid confusion. A persistence diagram is a scatterplot of birth levels against death levels for features of interest, in our case either components or holes. The left plot of Figure 3 shows a persistence diagram for components in Region 1 of the transformed GASS data. The first component to be born is by construction the last to die, producing the single point in the top left. Otherwise, the points are clustered in a loose oval. Points near the diagonal represent less persistent structures mostly associated with noise, whereas more significant features are usually associated with points away from the diagonal.
2.2
Convex peels and summary statistics
It is sometimes difficult to interpret or compare persistence diagrams, either because of the large number of points or the bunching of many points along the diagonal. We propose peeling successive convex hulls (Barnett 1976) until only a prescribed proportion of points remain, as illustrated in the right panel of Figure 3. In this way we extract the general shape of the persistence diagram without undue influence of either outliers or the mass of points near the boundary. We summarise the shape of the final convex hull by the following 7
4 −4
−2
0
Death level
2
4 2 0
Death level
−2 −4 −4
−2
0
2
4
−4
Birth level
−2
0
2
4
Birth level
Figure 3: Component persistence diagram (left) and 90% convex peel (right, as dots) for Region 1 in the GASS data. Grey lines in the right panel represent the peeled convex hulls.
8
five statistics: 1. the two centroid coordinates (Cb , Cd ); 2. the perimeter P ; 3. the area A; 4. the filamentarity, defined as (Bharadwaj et al. 2000, Makarenko et al. 2015) F =
P 2 − 4πA , P 2 + 4πA
0 ≤ F ≤ 1.
Thus defined, F = 0 for a circle and F = 1 for a line.
2.3
Bottleneck distance
The bottleneck distance provides a measure in the space of persistence diagrams and can be used as a quantitative summary of the difference between two persistence diagrams, A and B say. A readable description, and further information, is given by Fasy et al. (2014). First, the points in A are matched to the points in B. This means that each point in A is mapped either to a unique point in B, or to its projection onto the diagonal line of birth-death equality. The same is true for B. Use of the diagonal is necessary because A and B can have different numbers of points. The cost of a mapping from a point (aix , aiy ) in A to a point (bjx , bjy ) in B is the norm cij = max{| aix − bjx |, | aiy − bjy |}. The total cost of a matching P between A and B is C(P ) = distance is then w(A, B) = min C(P ), P
9
P
i,j
cij . The bottleneck
where the minimum is taken over all possible selections of the linked pairs. This, the bottleneck distance, is essentially the smallest cost that could be incurred in mapping all points of A into all points of B. Calculation is numerically expensive, though efficient algorithms are available. For our application, we used the tda package in R.
3
Topology of a Gaussian random field on a lattice
The persistent homology of a two-dimensional random field depends upon the number and distribution of local maxima and minima. In this section, we investigate the number of local extrema in a Gaussian random field on a d × d lattice. The results will be used in Section 6 to benchmark an analysis of the H i data. We assume stationarity, isotropy and N(0,1) margins throughout. The density of critical points of a Gaussian random field in continuous space is a wellstudied problem (Bardeen et al. 1986). It depends upon the joint Gaussian distribution of function values and first and second derivatives. We are not aware of published results for local extrema of a field on a discrete lattice however, although there has been work on the distribution of the global maximum over a region (Taylor et al. 2007).
3.1
Expected number of local extrema
Given a field z(x), let N0 and N1 be the numbers of points in persistence diagrams for components and holes respectively. As previously stated, each point in a persistence diagram for components corresponds to a local minimum of the field, and each point in a persistence diagram for holes corresponds to a local maximum. Due to the symmetry of the Gaussian distribution, the number of local minima has the same statistical properties as the number of local maxima. Because it is slightly tidier notationally, we will consider 10
local maxima in this section. Let z1 = z(x1 ) be the (scalar) field value at some location x1 and let z2 = z(x2 ) be the k-dimensional vector of field values at the immediate neighbours x2 of x1 , with the neighbourhood to be defined later. Then, because we have standardised, (z1 , z2 ) is Gaussian, with zero mean and variance matrix T 1 r . r R Here the k-vector of correlations r between z1 and z2 , and the k × k correlation matrix R of z2 , each depend on the locations x1 and x2 , though this is suppressed in the notation. We have a local maximum at x1 if z2 < z1 1k where 1k is a k−vector of ones. We can calculate this from the conditional distribution of z2 given z1 and the marginal distribution of z1 . First let φ(p) (x; µ, Σ) be the p-dimensional Gaussian probability density with mean vector µ and covariance matrix Σ, with Φ(p) (x; µ, Σ) being the corresponding cumulative distribution function. So z1 has density φ(1) (z1 ; 0, 1), z2 has density φ(k) (z2 ; 0k , R), where 0k is a k−vector of zeros, and the conditional density for z2 given z1 is φ(k) (z2 ; rz1 , R − rrT ). Then
11
p1 = P (local maximum at x1 ) = P (z2 < z1 1k ) Z Z
(k)
z1
Z
φ (z2 ; rz1 , R − rr ) dz2 φ(1) (z1 ; 0, 1) dz1
=
T
z2