Machine Learning Techniques for Exoplanet

Machine Learning Techniques for Exoplanet Detection

Master’s Thesis of

Ralf Farkas at the Institut de Planétologie et d’Astrophysique de Grenoble (IPAG), Université Grenoble Alpes (UGA) Fakultät für Physik, Karlsruher Institut für Technologie (KIT)

Supervisors:

Prof. Dr. David Mouillet — UGA Grenoble Dr. Carlos A. Gomez Gonzalez — UGA Grenoble Prof. Dr. Florian Bernlochner — KIT Karlsruhe

Duration: 28th February 2018 – 7th June 2018

Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Background 1.1 Direct Detection of Exoplanets . . . . . . . . . . . . 1.2 Angular Differential Imaging . . . . . . . . . . . . 1.3 Machine Learning (ML) / Neural Networks (NN) 1.4 SODINN . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Overview of Software Tools . . . . . . . . . . . . .

1

. . . . .

2 2 5 9 10 10

2

Angular Differential Imaging Methods 2.1 ANDROMEDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 SODINN-pairwise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11 11 14

3

Metrics 3.1 ADI algorithm evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 ROC curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17 17 18

4

Data Augmentation 4.1 Variational Autoencoders (VAEs) . . . . . . . . . . . . . . . . . . . . . . . 4.2 Generative Adversary Networks (GANs) . . . . . . . . . . . . . . . . . . . 4.3 Noise Generation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21 22 23 25

5

Conclusions and Outlook

28

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

Appendix I A References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I B Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V C Open Source Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . VII

i

List of Figures

1.1 1.2 1.3 1.4

Input data to the algorithms. . . . . . . . . . . . . . . . ADI sequence. . . . . . . . . . . . . . . . . . . . . . . . Functioning of the annulus-wise median subtraction. Functioning of LOCI model PSF creation. . . . . . . .

. . . .

4 4 7 8

2.1 2.2 2.3 2.4 2.5

ANDROMEDA signal patterns. . . . . . . . . . . . . . . . . . . . . . . . . Results of the ANDROMEDA algorithm. . . . . . . . . . . . . . . . . . . . Results of LOCI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Signal patches for SODINN-pairwise. . . . . . . . . . . . . . . . . . . . . SODINN-pairwise likelihood map and binary maps at different threshold levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13 13 13 16

3.1

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

16

3.2

ROC curves of SODINN-pairwise models with different amounts of training samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ROC curves of ANDROMEDA and full-frame PCA. . . . . . . . . . . . .

19 20

4.1 4.2 4.3 4.4 4.5 4.6

Operational sequence of the VAE. . . . . . . . . . . . . . . . . . . . . Operational sequence of the GAN. . . . . . . . . . . . . . . . . . . . Noise patches for SODINN-pairwise, obtained from the data cube. VAE-generated noise patches. . . . . . . . . . . . . . . . . . . . . . . GAN-generated noise patches. . . . . . . . . . . . . . . . . . . . . . Training loss of the GAN. . . . . . . . . . . . . . . . . . . . . . . . .

24 24 26 26 26 27

. . . . . .

. . . . . .

. . . . . .

C.1 Contribution statistics to the VIP GitHub repository. . . . . . . . . . . . . VII

ii

Introduction Exoplanet detection can be achieved through indirect methods like radial velocities or planet transits, or through direct observations. Direct detection methods are attractive as they can reveal additional information about the planets, like the composition of the atmosphere or the potential presence of a disk of dust. Because of the high contrast between the star and its companion, detecting a planet requires a strong interplay between optical hardware, detection techniques and posttreatment by specialized algorithms. The latter will be the subject of this thesis.

Chapter 1 aims to be an introduction to the terminology used in this work. I explain the use of Angular Differential Imaging (ADI) and provide information about the functioning of ADI algorithms, using the examples of Median Subtraction, LOCI and PCA. Machine Learning is an area of computer sciences that is increasingly being used to solve physical problems. I highlight the importance of the learning phase of Neural Networks and show how the SODINN algorithm is a first link between ADI and Machine Learning. Chapter 2 is dedicated to two ADI algorithms: ANDROMEDA and SODINN-pairwise. I ported the ANDROMEDA codebase from IDL to python, and contributed it to the open source VIP package for high contrast imaging. ANDROMEDA has a unique approach to ADI by applying pattern matching to pairwise subtracted frames, a task which can also be expressed as a Machine Learning problem: I present the new ADI algorithm SODINN-pairwise, which builds upon SODINN’s network architecture and ANDROMEDA’s pattern matching idea. Chapter 3 shows the use of ROC curves to quantify the performance of ADI algorithms. I build ROC curves to compare the new ANDROMEDA implementation to the known Median Subtraction and PCA algorithms, and use them on the new SODINN-pairwise to stress the importance of training the model with a large number of samples. Chapter 4 presents two types of Neural Networks, known and used in computer science, which I implemented and adapted to the task of generating new noise samples for training SODINN-pairwise. 1

Chapter 1

Background

1.1

Direct Detection of Exoplanets

The majority of the ∼4000 planets known today were detected using indirect detection methods, like radial velocity, transits and microlensing. Only ∼50 planets were discovered by direct imaging, mainly due to the challenging nature of high-contrast imaging. For a direct detection, there are two main factors: – angular distance. Seen from the earth, the angular distance between a star and its companion is ∼ 0.01′′ –1′′ . A high angular resolution is required to resolve these separations. – contrast. The planets are very faint compared to their star: The contrast ratio between a star and its planets typically ranges from 10−4 to 10−10 . A coronagraph can be used to increase the contrast by blocking out the light of the star. Additionally, for ground-based instruments: – atmosphere. The turbulence in the terrestrial atmosphere introduces distortions to the wavefront, which degrade the image quality and resolution. These errors can be corrected by an Adaptive Optics (AO) system, which detects the distortions and compensates for them using a deformable mirror. For large telescopes which would otherwise be limited by the turbulence (resolution ∼0.5′′ ), an AO system allows for angular resolutions down to the diffraction limit (typically 0.02′′ in the visible and 0.04′′ in near-infrared (NIR)).

detections. The first detections using direct imaging were β Pictoris b (Lagrange et al. 2009) with the NaCo instrument on the VLT, and three companions around HR 8799 (Marois et al. 2008) using the Keck and Gemini telescopes. Both discoveries were made in the NIR in conjunction with an AO system, which allowed to detect a contrast up to ∼104 (2.3 × 104 in the case of HR 8799b).

2

With second-generation instruments like GPI and SPHERE, dedicated to exoplanet imaging with advanced AO and coronagraphs, it was possible to raise the maximum contrast close to 106 , as show the detections of 51 Eri b with GPI (Macintosh et al. 2015) and HIP 65426 with SPHERE (Chauvin et al. 2017). As the luminosity of a planet depends on its age and mass, with future advancements in telescope technology and postprocessing we anticipate discovering less young (cooled) and lighter exoplanets, as well as achieving better characterization capabilities.

PSF. The point spread function (PSF) describes how the image of a punctual source appears through an optical system. Most of the objects observed through telescopes, e.g. stars and their companions, are considered as point sources, which then produce a PSF on the sensor. Figure 1.1b shows an example PSF, obtained from a SPHERE IRDIS observation.

Speckles. Optical aberrations, which the AO was not able to correct, and imperfections inside the telescope optics and the measuring instruments cause residuals called speckles. They look like bright blobs, similar to PSFs and reduce the overall contrast (Figure 1.1a). We suppose that the speckle noise has two components: the (quasi-)static speckles with timescales of several seconds to hours, and the dynamic speckles with much shorter timescales of a few milliseconds. While the dynamic speckles average out over time, the removal of the quasi-static speckle pattern requires specific methods, like Angular Differential Imaging (ADI).

3

(a) One frame from the data cube.

(b) The PSF associated with the data cube.

Figure 1.1 – Input data to the algorithms. Figure 1.1a shows the first of 488 frames of the β Pictoris b data cube which was used for this thesis. The data was collected using the SPHERE IRDIS instrument in the K1 band (near-infrared, 2110 nm). The coronagraph attenuates the signal in the central regions by a factor of ∼100. The four outer bright blobs are satellite spots, created by the deformable mirror of the AO, which can be used for centring the frames. Figure 1.1b is the PSF associated with the data cube, obtained by averaging noncoronagraphic observations of the same star.

(a) first frame

(b) central frame

(c) last frame

Figure 1.2 – ADI sequence. Three frames of the 488 frames in the cube are shown. The four outer satellite spots and the speckles stay stationary, while the (injected) companion, located at the top left of the centre, moves clockwise.

4

1.2

Angular Differential Imaging

Angular Differential Imaging (ADI) describes the combination of a special observation technique and a data analysis method.

observation technique. A series of images is taken with the telescope’s field derotator switched off. The field derotator is normally used to rotate the sensor of the telescope during an observation series, so that the observed object (the star) is always in the same orientation. Now, the field of view rotates together with the instrument while the star’s PSF is always equally aligned. The planet seems to turn around the star, like can be seen in Figure 1.2.

data set. The data set obtained through ADI, with the time/frame number along the z axis is called data cube. A data cube from the SPHERE IRDIS instrument contains several hundred frames in both spectral channels, with the frame size given by the detector, e.g. 1024×1024 px. The individual frames have already been corrected for optical and detector defects through the instrument pipeline. We then usually centre the frames and crop them (e.g. to 250×250 px, see Figure 1.1a). In addition to the data cube we obtain a file with the parallactic angles (PAs) which the field of view has rotated (a vector of length equal to the number of frames in the cube, which can be calculated precisely by geometric considerations) and a PSF (which is typically an image of the same star taken without the coronagraph, see Figure 1.1b). The PSF can also be used for fake planet injections, where a copy of the PSF is placed inside the data cube, with respect to the PA associated to each frame, to simulate the presence of a companion. This is e.g. useful for benchmarking ADI algorithms using ROC curves (chapter 3).

frame differencing. The principal idea behind differential imaging techniques is that a model PSF can be constructed and then subtracted from the frames in order to remove the speckles while keeping a possible companion intact. The procedure of choosing how the subtraction terms should be constructed is called frame differencing. After each frame has been processed by subtracting the model PSF, the frames are de-rotated to align the planet signal, and the final image is obtained by collapsing the frames using a median.

5

rotation threshold. Usually frame pairs are chosen based on the rotation between them. When subtracting one frame from another, it is preferable that a possibly present companion has been displaced enough in the angular direction so that the planet is not subtracted from itself (self-subtraction). As the minimum rotation threshold depends on the radial distance of the planet, each frame is usually divided into annuli before calculating a frame difference. median subtraction. The paper of Marois et al. (2006), which first introduced ADI, proposes two methods of frame differencing. In the first method, a median of all frames is calculated and subtracted from each individual frame. The median averages out the statistic fluctuations in each frame, while the subtraction removes the static speckle pattern. A second, slightly more sophisticated method, constructs a model PSF for each frame i in the data cube using the following procedure (see Figure 1.3): – The frame i is divided into annuli. For every annulus, the two next and two previous frames are selected, which satisfy a rotation threshold (e.g. 1.5 FWHM). – The 4 selected frames are combined with a median to obtain the model PSF, which is then subtracted from the frame i. By limiting the number of frames for the model PSF construction, the temporal variation in the data cube is better taken into account. LOCI. The LOCI algorithm (Locally Optimized Combination of Images, Lafrenière et al. 2007) creates the model PSF by segmenting each annuli, as shown in Figure 1.4: – The frame i is divided into segments, as are the N reference images (which are the other frames in the cube, filtered by a rotation threshold). – For each segment the optimum linear combination of the N reference images is formed, so that the noise in each segment is minimized. PCA Principal Component Analysis (PCA)-based algorithms were proposed by Soummer et al. (2012, KLIP) and Amara and Quanz (2012, PynPoint). The star’s PSF is decomposed into a set of basis functions (like Fourier series or Hermite polynomials, or determined empirically from the data). Each frame in the data cube is then expressed as a linear combination of those basis functions, truncating at a fixed number of PCA components to construct the model PSF. 6

e tim data cube

frame i

median of 4

select 4 closest frames per annulus reference PSF for i

Figure 1.3 – Functioning of the annulus-wise median subtraction. For each annulus in each frame in the data cube, a model PSF is created by taking into account four surrounding frames, which are combined by a median.

7

...

frame i for each segment

a×

+ b×

+ c×

+ … + n×

linear combination

reference PSF for i

Figure 1.4 – Functioning of LOCI model PSF creation. LOCI uses a annulus-segmentwise linear combination to obtain a model PSF.

8

1.3

Machine Learning (ML) / Neural Networks (NN)

Machine Learning (ML) is a sub-area of computer sciences, where the machine is deciding itself, based on some criteria, what is important for understanding and solving a well specified problem. Neural Networks (NNs) are a family of algorithms used in ML. A NN is a usually large set of nodes grouped into layers, which are interconnected. Different types of problems require different network types, and each problem requires a specialized network. Before a NN can be used, it needs to be trained with data. In the case of supervised learning, that data is known and labelled, so the network knows which result to expect. The learning phase of a NN consists of – forward pass. The network receives some data which it passes through its layers, which are described by a set of non-linear mathematical operations. – loss. The network compares the result it obtained from the forward pass with the expected output (labels). It calculates the difference between the actual and the expected result (which could be a simple mean square error), which is called the loss. – backpropagation. The network goes through its layers and nodes in inverse direction and updates the weights (the importance of each interconnection), so that on the next forward pass, the actual result should match the expected result closer. – the above procedure is repeated many times, until a specific criterion is met, e.g. the loss does not change any more. If a network is well trained, it has acquired some power of abstraction, so it can be applied to new (unseen) data. The key here is that the network itself decides what features of the training data it finds important, and chooses an internal representation of those features. In the last 12 years since ADI was proposed, various algorithms were developed, more and more complex, all trying to remove the speckles and noise so that a human can look at the final, cleanedup image and decide if he sees a planet or not (or thresholding the SNR map at an arbitrary value). Using ML, the approach is different: as the network has learned how a planet looks like in that specific data set, or in that specific instrument, it can find a companion by giving a probability at a certain position.

9

1.4

SODINN

The SODINN algorithm proposed by Gomez Gonzalez et al. (2018) uses a Convolutional Long Short-Term Memory (ConvLSTM) network to distinguish planets from noise. The algorithm first needs to be trained using known (labelled) data: These are small patches (of e.g. 9×9 px) with either a planet signature (labelled c+ ) or a noise pattern (labelled c− ).

MLAR patches. A patch is a three-dimensional stack of slices, where each slice is a residual patch after the subtraction of an singular value decomposition (SVD)-based reconstruction at a different truncation level, hence the name choice of Multilevel Lowrank Approximation Residual (MLAR). For the c+ class, fake planets were injected with varying flux. For the c− class, a flipped-sign derotation is used to increase the diversity while keeping the same statistics.

1.5

Overview of Software Tools

Most of the development for this work was done in python, using the following libraries and tools: The SciPy stack with its core packages scipy (Jones et al. 2001), numpy (T. E. Oliphant 2015) and matplotlib (Hunter 2007) for numerical algorithms, computing, and plotting. Keras (Chollet et al. 2015) with the TensorFlow (Abadi et al. 2015) backend for machine learning. These packages provide tools for constructing and training neural networks. The SPHERE data center (Delorme et al. 2017) which provided the data sets processed in this work.

10

Chapter 2

Angular Differential Imaging Methods

VIP. The VIP python package by Gomez Gonzalez et al. (2017) is a library for highcontrast imaging. Besides tools for opening and displaying ADI sequences, pre-processing them (like frame alignment and filtering), VIP provides an intuitive interface to various implementations of classical ADI algorithms like Median Subtraction, LOCI and PCA, but also recent works like LLSG (Local Low-rank plus Sparse plus Gaussiannoise decomposition, Gomez Gonzalez et al. 2016). As an open source framework it aims to be accessible, universal and easily extensible.

2.1

ANDROMEDA

I was kindly provided with the Interactive Data Language (IDL) source code of the ANDROMEDA algorithm (ANgular DiffeRential OptiMal Exoplanet Detection Algorithm, Mugnier et al. 2009, Cantalloube et al. 2015) by Faustine Cantalloube for my work on this thesis. This allowed me to have a closer look at the functioning but also at the implementation details of the pattern generation and matching. The ANDROMEDA algorithm creates signal patterns and matches these patterns with every pixel in the data cube. A signal pattern consists of a shifted PSF minus a differently shifted copy of the PSF: patt∆x1 ,∆y1 ,∆x2 ,∆y2 = shift(PSF1 , ∆x1 , ∆y1 ) − shift(PSF2 , ∆x2 , ∆y2 ) By varying both shifts, various distances and relative angles between the two PSFs can be covered. Figure 2.1 shows a few of the PSF patterns.

11

The algorithm works as follows: – Before the algorithm actually processes the data cube, a PSF library is created. That PSF library is a stack of subpixel-shifted versions of the PSF, obtained by phase ramping. The PSF library is pre-calculated to save computation time inside the pixel loops of the algorithm core. – Frame pairs are formed by taking into account a minimum rotation threshold. – The data cube is divided into annuli, then every every pixel r0 in an annulus is chosen as a candidate position for a companion: – If a companion would be present at the position r0 , it is known where it would be in each of the frame pairs. For every frame pair a PSF pattern is created by placing PSF1 and PSF2 at the corresponding locations. Even if a companion is exactly at the pixel r0 in the first data cube frame, it can be between two pixels in the following frames. This is taken into account by actually placing subpixel-shifted versions of the PSF, taken from the PSF library. – The frame pairs are matched with their corresponding PSF patterns to construct a maximum likelihood estimator. – The flux and signal-to-noise ratio (SNR) maps, shown in Figure 2.2, directly result from the per-pixel maximum likelihood estimators.

vip.andromeda. I ported the IDL source code to python and contributed it as open source to the VIP framework. It is tightly integrated with the available and newly developed tools in VIP, providing a simple interface for ADI and sophisticated benchmarks and comparison between ANDROMEDA and other algorithms (see chapter 3). Also, by taking advantage of python’s native multi-processing capabilities, it was possible to significantly decrease the computation time, e.g. from ∼10 min down to ∼1 min 30 s (for the same ADI dataset processed on the institute’s ipag-calc1 machine, and 20 concurrent processes for the python version). As a side product of the porting process, the python library idlwrap was developed and published, which can help future projects to transition from IDL to python. A more detailed description can be found in Appendix C.

12

Figure 2.1 – ANDROMEDA signal patterns. Each pattern consists of the difference two shifted PSFs. Shifting is done by a phase ramp, which also allows for sub-pixel shifting. The patterns are matched with each pairwise frame difference.

Figure 2.2 – Results of the ANDROMEDA algorithm. Input to the algorithm was the data cube from Figure 1.2. Figure 2.2a shows the final flux map. While the injected planet on the top-left was revealed, the inner regions still show residual starlight. The SNR map in Figure 2.2b was normalized radially by its robust deviation, so that a constant, distanceindependent threshold can be applied. The binary map in Figure 2.2c is thresholded at 50 σ.

Figure 2.3 – Results of LOCI. Figure 2.3a shows the final reduced image obtained from the ADI data cube shown in Figure 1.2 with the LOCI algorithm. The injected companion was recovered with a high flux value. Figure 2.3b shows the SNR maps which was constructed based on the reduced frame, Figure 2.3c after thresholding it at 5 σ. 13

2.2

SODINN-pairwise

Besides the removal of residual starlight to bring out the signal of the planet, like it is the case for Median Subtraction and LOCI, ANDROMEDA goes one step further by actively searching for a planet signature. That approach of pattern matching can also be exploited using NNs, in a similar fashion to the original SODINN. We propose a new ADI algorithm named SODINN-pairwise, which builds on the ideas of ANDROMEDA and SODINN. The procedure of SODINN-pairwise consists of three phases: 1. samples. Positive (c+ ) and negative (c− ) patches are extracted from the data cube. 2. training. The two classes of patches are used to train a model which learns to discriminate between them. 3. prediction. The trained model is then fed with patches from the data cube, and for each pixel a probability of a match (class c+ membership) is recorded, resulting in a probability map.

positive patches. The frames of the data cube are divided into annuli with a width of 1 FWHM. In each annulus 3 fake companions are injected at equidistant angles by placing a flux-scaled copy of the PSF on the frames of the cube. For each frame, the closest following frame is chosen which has a minimum rotation threshold (e.g. 0.5 FWHM), and the two frames are subtracted from each other. The rotation threshold has to be smaller than 1 FWHM, as we want each (fake) planet signal to be subtracted from itself, to create a pattern, similar to ANDROMEDA. The 3 signal patterns are then cut out from the cube to obtain patches of 20×20 px, some of which are shown in Figure 2.4. The signal patches constitute the class c+ . To avoid creating secondary patterns of possibly present (but unknown) astrophysical signal in the frame differences, all PAs were multiplied by −1 (sign-flipped). Negative rotation ensures that unwanted signal in the background is not aligned any more and behaves like a random fluctuation when subtracting. negative patches. The negative patches, forming the class c− , containing only noise. They are grabbed analogously to the signal patches (trough pairwise frame difference with a rotation threshold and cropping). As the structure of the noise is not known, it is not straightforward to create synthetic samples like done for the positive patches. Figure 4.3 shows the diversity of the extracted noise patches. Their structure varies

14

strongly between big blobs (speckle-dominated regions at the central parts of the frame) and Gaussian-like noise (outer regions).

training. The obtained positive and negative patches are then passed to the SODINNpairwise model, which is a modified version of the original SODINN model presented by Gomez Gonzalez et al. (2018). SODINN is built around a ConvLSTM, which is adapted for 3 dimensional samples and can exploit both the spatial and sequential correlations present in the MLAR patches. SODINN-pairwise is in essence a Convolutional Neural Network (CNN) composed of – a convolutional layer with 16 filters of 3×3 px. A convolutional layer applies a convolution by moving a small filter kernel over the image. – Another convolutional layer with 64 filters, followed by a MaxPooling layer of 2×2 px. MaxPooling reduces the amount of information by only picking the maximum value inside the filter window. – A dropout layer which disables 25 % of the nodes during training, to prevent overfitting and improve generalization, – a flattening layer, which transforms 2D input to 1D output, which is then passed to a 128-node fully-connected layer, – another dropout layer with 50 % rate and – a single node with a sigmoid activation function to output a probability.

prediction. After successfully training the SODINN-pairwise model it can be applied to the entire data cube for prediction. For every pixel in a frame a patch is created, in the same fashion as it was done with the training patches, but by following the positive direction of rotation for the PAs. The SODINN-pairwise model then calculates how probable it is for the patch to belong to the c+ class. That probability is recorded for every pixel and leads to a probability map, as can be seen in Figure 2.5a.

15

Figure 2.4 – Signal patches for SODINN-pairwise. The patches were obtained by injecting fake planets into the data cube and taking pairwise differences of the frames.

Figure 2.5 – SODINN-pairwise likelihood map and binary maps at different threshold levels. Figure 2.5a shows the probability map of the SODINN-pairwise algorithm after processing the ADI sequence in Figure 1.2. Figure 2.5b and Figure 2.5c show the binary detection maps at τ = 0.1 and 0.2 respectively. A higher would result in no false positives (and 1 detection up to τ = 0.99). Note how the satellite spots are smeared out and not removed completely.

16

Chapter 3

Metrics

3.1

ADI algorithm evaluation

SNR maps. Traditionally, ADI algorithms aim at reducing speckles and other noise to produce a final images like the one shown in Figure 2.3a for LOCI. From that reduced image a SNR map is constructed, e.g. following the method of Mawet et al. (2014) (Figure 2.3b). That SNR map is then inspected and thresholded at a given value (e.g. SNR = 5σ) to obtain a binary map of the detections (Figure 2.3c).

true positive rate. When assessing the detection power of an ADI algorithm, one usually injects fake planets with known flux and position, and lets the algorithm recover them. If the algorithm recovers the injected planet, that is a true positive positive (TP). A false positive (FP) would be the case of background noise mistaken for a signal. A perfect ADI algorithm would recover all injected planets, but report no false positives. The number of true/false positives can be obtained by counting bright blobs of multiple connected pixels in a binary map: The binary map in Figure 2.5b show 15 FPs for a threshold of τ = 0.1, but only 2 FPs for a higher threshold of τ = 0.2. Increasing the threshold would further decrease the number of FPs, but could also lead to a lost companion signal. When processing multiple injections, the true positive rate (TPR), or completeness, is the fraction of the recovered TPs over the total number of injections.

contrast curves Contrast curves are commonly used in literature when assessing the quality of ADI algorithms. They show a 50 % completeness (TPR = 0.5) as a function of the angular separation. The TPR decreases with smaller separations, as the speckle noise becomes dominant in the central parts of the frame.

17

3.2

ROC curves

As the injection and recovery of fake planets is a binary classification problem, receiver operating characteristic (ROC) curves can be constructed using the TPR and FPs.

SODINN-pairwise. Figure 3.1 shows two ROC curves for the SODINN-pairwise algorithm, processing the same data set with two differently trained models. The lower, black squared curve was trained with the ×0.1-dataset, which consists of 954 noise and 954 signal samples (10 % sub-sampling of the total pairwise frames). The ×10dataset contains 2×95 400 samples, obtained by sampling the available pairwise frames multiple times. The ROC curves show that the ×10-training is more effective by two means: – TPR. The maximum TPR at 0 FPs is higher. Without false positives, ×10 is able to recover 60 % of the injected companions, while ×0.1 only recovers about 30 %. The same holds for a reasonable threshold like τ = 0.9. – FPs. At lower thresholds, ×10 claims less false positives than ×0.1. Also, for τ = 0.3, the TPR is the same for both models, but ×0.1 leading to more FPs. ANDROMEDA. Figure 3.2 shows a different set of injections, this time processed with ANDROMEDA and the classic ADI algorithms Median Subtraction and PCA. The performance of ANDROMEDA compares with that of PCA, although this should be considered with caution, as both algorithms could be tuned further (by choosing a higher number of principal components for PCA, and adapting the optimization parameters for ANDROMEDA). A deeper comparison has yet to be performed. Different threshold values were needed for the algorithms, as the ANDROMEDA algorithm generally outputs higher SNR values (leading e.g. to more FPs at 5 σ). When producing one single binary map, the choice of a threshold is difficult. A profound knowledge of the algorithm is indeed needed — a ROC curve avoids this problem by showing values at multiple thresholds.

18

Figure 3.1 – ROC curves of SODINN-pairwise models with different amounts of training samples. Two different models of SODINN-pairwise were trained: ×0.1 with 2×954 samples, ×10 with 2×95 400. The model with the increased number of training samples has a better recovery of companions (∼60 % at 0 FPs instead of ∼30 %), highlighting the importance of a large training dataset.

19

Figure 3.2 – ROC curves of ANDROMEDA and full-frame PCA. ANDROMEDA and full-frame PCA are roughly on the same level, both outperforming the original Median Subtraction. Full-frame PCA was calculated with 20 principal components, while the various parameters of ANDROMEDA were fine-tuned by hand.

20

Chapter 4

Data Augmentation

While we can create a nearly arbitrary amount of c+ patches, by injecting fake planets at different initial angles, radial distances and fluxes, the number of c− patches is limited by the noise which is already present in the data cube. One main factor for the performance of a NN is the amount of training samples — a network trained with too few samples does not generalize well. Figure 3.1 shows the performance of the SODINNpairwise algorithm with different amounts of training data. While the ROC curves are explained in detail in section 3.2, it can already be seen that the network performs better after training it with more samples.

data augmentation. Data augmentation is a technique which increases the amount of training data using information only present in the training data itself. In the field of computer vision and pattern recognition various NNs can be found which are adapted to the task of generating new image data. Especially Variational Autoencoders (VAEs) and Generative Adversary Networks (GANs) are generative models which have shown promise. As a large number of training samples is a requirement to fully exploit the power of SODINN-pairwise, I studied the adaptation of these generative models to our needs of generating more noise patches.

21

4.1 AE.

Variational Autoencoders (VAEs) Autoencoders (AEs) are a class of NNs which consist of two parts:

– encoder. The encoder takes an input image Iin (from the image space) and reduces it to a small number of parameters p1 , p2 , …in the lower-dimensional latent space. This is achieved by layers of decreasing size. – decoder. The decoder has the inverse task: it takes parameters from the latent space as input and builds an output image Iout . Its structure is usually the inverse of the encoder. encoder

decoder

input image Iin −−−−→ latent space p1 , p2 , . . . −−−−→ output image Iout ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶ autoencoder

The key is that the network works in an unsupervised fashion: An AE receives an unlabelled set of images and passes them through the network. It then compares the output image to the input and tries to reduce their distance on the further passes. When the output images match the input images, the representation in the latent space is sufficient to describe all images with just a few parameters — with the latent parameters and the decoder part of the AE we can reconstruct the initial image.

VAE. The Variational Autoencoder (VAE) was proposed by Kingma and Welling (2013) as a variant of the AE which is specially adapted for the generation of images. While we could in theory sample the latent space in an AE, in practice this is not possible as the distribution of the latent space parameters pn is not known. The Variational Autoencoder (VAE) adds a constraint which forces the parameters in the latent space to be distributed along a normal distribution. This is assured by an adapted loss function which is composed of two terms: – As for the AE, the output image needs to match the input image as closely as possible. This can be described as a binary classification problem output matches / does not match input, thus a binary crossentropy loss was chosen. – Additionally, the latent space parameters pn should follow a normal distribution. The Kullback-Leibler divergence term acts as a regularizer. For the network layout we chose 2 latent parameters, and one dense layer with 100 nodes in both the encoder and the decoder. The full network is shown in Figure 4.1.

22

sampling. The encoder network outputs two latent parameters, pmean and plog_var , representing the mean and the logarithm of the variance of the latent distribution. These two parameters are then used to sample new points from the latent space: ) ( p = pmean + exp 12 plog_var × ε , where ϵ is sampled from a random normal distribution. With that approach it is possible to separate the random sampling part from the implementation of the model itself. Backpropagation would not be possible otherwise as the random sampling operation is not differentiable. This is called reparametrization trick.

4.2

Generative Adversary Networks (GANs)

Generative Adversary Networks (GANs) (Goodfellow et al. 2014) have become popular lately as the generated samples are sharper compared to approaches like the VAE. GANs also consist of two sub-networks, but with a stronger interplay (see Figure 4.2): – generator. The generator G obtains Gaussian noise patches as input, which it modifies using 2D-convolutional and MaxPooling layers to generate fake data. – discriminator. The discriminator D is fed with fake data from G and real samples from the training data set. Its task is to distinguish these two classes. Its output is either ‘real’ or ‘fake’. While G tries to fool D by generating fake images that look like real ones (our final goal), D tries to identify the fake images. As long as both networks learn at the same pace, both can improve — if the equilibrium is no longer maintained and one network gets significantly stronger, all attempts of the other one are immediately blocked. If e.g. D is too strong, G has no possibility to know what specific features D requires — if G is stronger, D has no indication, as all generated images look perfectly valid.

23

normal random ε

pmean p real noise (from cube)

plog_var interm.

interm. encoder

decoder

Figure 4.1 – Operational sequence of the VAE. The random normal sampling is only part of the forward pass, and not of the backpropagation step.

generator G

gaussian noise (random)

label ‘0’

label ‘1’

discriminator D

‘real’ or ‘fake’

real noise (from cube) Figure 4.2 – Operational sequence of the GAN. The discriminator differentiates between the real samples from the data cube and the generated ones. If both the generator G and the discriminator D learn at the same pace, the model stabilizes. 24

4.3

Noise Generation Results

training data. To train the networks, noise patches were obtained by sampling the data cube as described for SODINN-pairwise in section 2.2, then normalized to unity. A small number of 954 noise patches of 20×20 px were chosen, to minimize their overlap and correlation, and to challenge the generating power of the models.

VAE. The VAE was fast to set up, and training was straightforward. After 300 epochs with 20 patches per batch, which took about ∼1 min, the model does not improve any more and can produces the patches shown in Figure 4.4. While replicating the different structures of the real noise, many samples are similar to each other.

GAN. The main problem with the GAN is that the weights of the generator do not always converge. Both the convergence and the divergence happen promptly after ∼100 epochs, when training with a batch size of 40 samples. Figure 4.6 shows the evolution of generator and discriminator losses over time, for a case where the training was successful. In Figure 4.5 some of the GAN-generated patches can be seen — compared to the VAE these patches are more homogeneous, and seem to replicate all structures (both the speckle-dominated inner regions and the finely-grained outer regions) at the same time.

25

Figure 4.3 – Noise patches for SODINN-pairwise, obtained from the data cube. The extracted patches show a large variety in the form of the noise. These patches were used both for training the SODINN-pairwise model in section 2.2 and the generative models in chapter 4.

Figure 4.4 – VAE-generated noise patches. The VAE replicates both the big and small structures from the noise data in Figure 4.3, although the small structures seem to be smeared out, as expected from the VAE.

Figure 4.5 – GAN-generated noise patches. The GAN seems to mix the blobs from the inner annuli of the frames with the random noise from the outer regions. 26

Figure 4.6 – Training loss of the GAN. Training with 954 samples, and 40 samples per batch. It can be seen that the discriminator and the generator work as adversaries, as one sub-network’s loss drops while the other one’s raises, until they find an equilibrium.

27

Chapter 5

Conclusions and Outlook

Conclusions. During the work on this thesis I had the opportunity to manipulate, improve and test different ADI algorithms. The re-implementation of ANDROMEDA in python and its integration into the framework provided by VIP allowed to compare ANDROMEDA to other ADI algorithms. The dense information content of the ROC curves highlighted the strengths of ANDROMEDA’s pattern matching approach, but also the importance of manually tuned parameters. The new SODINN-pairwise algorithm follows a similar approach to ANDROMEDA, but uses a ML classification instead of a maximum likelihood estimator. The training phase of SODINN-pairwise is a crucial part, and poses difficulties when training data is limited, as it is the case for the noise samples we can obtain from one single data cube. I proposed two data augmentation techniques which are able to generate more training samples. Following the idea of reproducible science, various contributions were made to open source, like the python implementation of the ANDROMEDA algorithm, different extensions and corrections for VIP, and a helper library for the transition from IDL to python (Appendix C).

Future Work With powerful telescopes and instruments, which produce more and more data, there is also a need for improved data analysis techniques. ML allows us to develop fast and scalable algorithms, that handle the large amount of data and integrate with the time constraints of astrophysical observations. Following the Double Master’s program of the Karlsruhe Institut für Technologie and the Université Grenoble Alpes, I have the opportunity to pursue the topics presented in this thesis for 9 more months, allowing me to further deepen my knowledge, improve the the current results and explore unanswered questions. The generative networks presented in chapter 4 look promising, but further investigation is needed to quantitatively assess the quality of the generated samples and address

28

their shortcomings. The computer science community proposed various variants of the networks I used, like the DeLiGAN (Gurumurthy et al. 2017), which provides stability in training and diverse generated samples, but also entirely new techniques for data augmentation, like the SamplePairing technique (Inoue 2018). The SODINN-pairwise algorithm itself also needs to be tuned, and extensively compared to the other approaches, including the original SODINN algorithm, which is still of unmatched performance. The idea of transfer learning needs to be explored, a technique where a networks like SODINN-pairwise is pre-trained with more general data and then fine-tuned for each case. This could allow us to process new ADI data in near real time, by using a model which was trained in advance with other observations from the same instrument.

29

Appendix A

References

A.-M. Lagrange, D. Gratadour, G. Chauvin, T. Fusco et al. (2009): A probable giant planet imaged in the β Pictoris disk. VLT/NaCo deep L’-band imaging A&A 493, pp. L21–L25 doi: 10.1051/0004-6361:200811325

C. Marois, B. Macintosh, T. Barman, B. Zuckerman et al. (2008): Direct Imaging of Multiple Planets Orbiting the Star HR 8799 Science 322, p. 1348 doi: 10.1126/science.1166585

B. Macintosh, J. R. Graham, T. Barman, R. J. De Rosa et al. (2015): Discovery and spectroscopy of the young jovian planet 51 Eri b with the Gemini Planet Imager Science 350, pp. 64–67 doi: 10.1126/science.aac5891, arXiv: 1508.03084

G. Chauvin, S. Desidera, A.-M. Lagrange, A. Vigan et al. (2017): Discovery of a warm, dusty giant planet around HIP 65426 A&A 605, L9, p. L9 doi: 10.1051/0004-6361/201731152, arXiv: 1707.01413

C. Marois, D. Lafrenière, R. Doyon, B. Macintosh and D. Nadeau (2006): Angular Differential Imaging: A Powerful High-Contrast Imaging Technique ApJ 641, pp. 556–564 doi: 10.1086/500401

I

D. Lafrenière, C. Marois, R. Doyon, D. Nadeau and É. Artigau (2007): A New Algorithm for Point-Spread Function Subtraction in High-Contrast Imaging: A Demonstration with Angular Differential Imaging ApJ 660, pp. 770–780 doi: 10.1086/513180

R. Soummer, L. Pueyo and J. Larkin (2012): Detection and Characterization of Exoplanets and Disks Using Projections on Karhunen-Loève Eigenimages ApJL 755, L28, p. L28 doi: 10.1088/2041-8205/755/2/L28, arXiv: 1207.4197

A. Amara and S. P. Quanz (2012): PYNPOINT: an image processing package for finding exoplanets MNRAS 427, pp. 948–955 doi: 10.1111/j.1365-2966.2012.21918.x, arXiv: 1207.6637

C. A. Gomez Gonzalez, O. Absil and M. van Droogenbroeck (2018): Supervised detection of exoplanets in high-contrast imaging sequences ArXiv e-prints arXiv: 1712.02841

E. Jones, T. Oliphant, P. Peterson et al. (2001): SciPy: Open source scientific tools for Python url: http://www.scipy.org/

T. E. Oliphant (2015): Guide to NumPy 2nd, USA: CreateSpace Independent Publishing Platform isbn: 151730007X, 9781517300074 url: http://web.mit.edu/dvp/Public/numpybook.pdf

J. D. Hunter (2007): Matplotlib: A 2D Graphics Environment Computing in Science Engineering 9.3, pp. 90–95 issn: 1521-9615 doi: 10.1109/MCSE.2007.55

II

F. Chollet et al. (2015): Keras url: https://keras.io

M. Abadi, A. Agarwal, P. Barham, E. Brevdo et al. (2015): TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems url: https://www.tensorflow.org/

Ph. Delorme, N. Meunier, D. Albert, E. Lagadec et al. (2017): The SPHERE Data Center: a reference for high contrast imaging processing SF2A-2017: Proceedings of the Annual meeting of the French Society of Astronomy and Astrophysics ed. by C. Reylé, P. Di Matteo, F. Herpin, E. Lagadec et al., pp. 347–361 arXiv: 1712.06948

C. A. Gomez Gonzalez, O. Wertz, O. Absil, V. Christiaens et al. (2017): VIP: Vortex Image Processing Package for High-contrast Direct Imaging AJ 154, 7, p. 7 doi: 10.3847/1538-3881/aa73d7, arXiv: 1705.06184 , url: http://github.com/vortex-exoplanet/VIP

C. A. Gomez Gonzalez, O. Absil, P.-A. Absil, M. van Droogenbroeck et al. (2016): Low-rank plus sparse decomposition for exoplanet detection in direct-imaging ADI sequences: The LLSG algorithm A&A 589, A54, A54 doi: 10.1051/0004-6361/201527387, arXiv: 1602.08381

L. M. Mugnier, A. Cornia, J.-F. Sauvage, G. Rousset et al. (2009): Optimal method for exoplanet detection by angular differential imaging J. Opt. Soc. Am. A 26.6, pp. 1326–1334 doi: 10.1364/JOSAA.26.001326, url: http://josaa.osa.org/abstract.cfm?URI=josaa-26-6-1326

F. Cantalloube, D. Mouillet, L. M. Mugnier, J. Milli et al. (2015): Direct exoplanet detection and characterization using the ANDROMEDA method: Performance on VLT/NaCo data A&A 582, A89 doi: 10.1051/0004-6361/201425571, arXiv: 1508.06406

III

D. Mawet, J. Milli, Z. Wahhaj, D. Pelat et al. (2014): Fundamental Limitations of High Contrast Imaging Set by Small Sample Statistics ApJ 792, 97, p. 97 doi: 10.1088/0004-637X/792/2/97, arXiv: 1407.2247

D. P. Kingma and M. Welling (2013): Auto-Encoding Variational Bayes ArXiv e-prints arXiv: 1312.6114

I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu et al. (2014): Generative Adversarial Networks ArXiv e-prints arXiv: 1406.2661

S. Gurumurthy, R. Kiran Sarvadevabhatla and V. Babu Radhakrishnan (2017): DeLiGAN : Generative Adversarial Networks for Diverse and Limited Data ArXiv e-prints arXiv: 1706.02071

H. Inoue (2018): Data Augmentation by Pairing Samples for Images Classification ArXiv e-prints arXiv: 1801.02929

IV

Appendix B

Acronyms

ADI Angular Differential Imaging. AE Autoencoder. A type of Neural Network. ANDROMEDA ANgular DiffeRential OptiMal Exoplanet Detection Algorithm. AO Adaptive Optics. CNN Convolutional Neural Network. ConvLSTM Convolutional Long Short-Term Memory. Unit for a Recurrent Neural Network, remembering its state for a long period of time. FP false positive. FWHM full-width at half maximum. GAN Generative Adversary Network. A type of Neural Network. GPI Gemini Planet Imager. HCI instrument on the 8 m Gemini South Telescope in Chile. First light in 2013. IDL Interactive Data Language. A programming language for data analysis, especially used in astronomy. IRDIS InfraRed Dualbeam Imager and Spectrograph. One of the instruments on SPHERE, besizes IFS and ZIMPOL. LLSG Local Low-rank plus Sparse plus Gaussian-noise decomposition. An ADI algorithm which uses an advanced sequence decomposition. LOCI Locally Optimized Combination of Images. An ADI algorithm. ML Machine Learning. MLAR Multilevel Low-rank Approximation Residual. NaCo NAOS-CONICA, short for Nasmyth Adaptive Optics System + COude Near Infrared CAmera. NIR (1 µm to 5 µm) instrument on VLT’s UT4 from 2001–2013, reinstalled on UT1 in 2014. NIR near-infrared. Wavelength ranging from ∼780 nm to ∼2.5 µm. NN Neural Network. V

PA PCA PSF ROC SNR SODINN

SPHERE

SVD TP TPR VAE VLT

parallactic angle. Principal Component Analysis. point spread function. Response of an optical system to a point source. receiver operating characteristic. signal-to-noise ratio. Supervised exOplanet detection via Direct Imaging with deep Neural Networks. ADI algorithm which uses a Neural Network and MLAR patches. Spectro-Polarimetric High-Contrast Exoplanet REsearch. Visible/NIR Instrument on VLT’s UT3 telescope, featuring AO and coronographic systems for imaging, spectroscopy and polarimetry. First light in 2014. singular value decomposition. true positive positive. true positive rate. Variational Autoencoder. A variant of the AE, optimized for data generation. Very Large Telescope. Telescope facility in Chile, with 4 8.2 m and 4 1.8 m telescopes (UT1–4 and AT1–4), operated by the ESO since 1998.

VI

Appendix C

Open Source Contributions

VIP / vip.andromeda. A large number of contributions was made to the public VIP repository on GitHub1 (see Figure C.1), most notably the development of a python implementation of the ANDROMEDA algorithm.

idlwrap. With numpy and scipy, there are powerful and open-source tools available for scientific computing in python. Currently, still lots of scientific projects — especially in astrophysics — rely on expensive licenses for the IDL programming language instead of moving forward to open and reproducible science. During the development of the vip.andromeda port, ∼50 IDL functions/routines and syntactic specialities of IDL were directly replicated in pure python, to automatically take care of e.g. the reversed memory order of matrices or array indexing. In a second step, those intermediary functions were replaced by direct calls to scipy. Those wrapper functions around scipy were bundled in a python package, idlwrap, and published on the Python Package Index2 . Together with an extensive guide to the core differences between IDL and python3 , idlwrap aims to help other IDL users to rapidly transition to python.

Figure C.1 – Contribution statistics to the VIP GitHub repository. During this thesis, a large number of contributions was made to the Vortex Image Processing python package, whose development takes place on the open source platform GitHub. 1

https://github.com/vortex-exoplanet/VIP

2

https://pypi.org/project/idlwrap

3

https://r4lv.github.io/idlwrap

VII