Adaptive Cluster Detection David Aaron Friedenberg

3 downloads 55 Views 4MB Size Report
Apr 28, 2010 - c 2010 by David Aaron Friedenberg. All rights reserved. ...... Ward's intergroup distance is based on the distance between cluster centers and is ...
Carnegie Mellon University

Adaptive Cluster Detection A Dissertation Submitted to the Graduate School in Partial Fulfillment of the Requirements for the degree Doctor of Philosophy in Statistics

by

David Aaron Friedenberg

Department of Statistics Carnegie Mellon University Pittsburgh, PA 15213

April 28, 2010

c 2010 by David Aaron Friedenberg

All rights reserved.

Abstract The next generation of telescopes will acquire terabytes of image data on a nightly basis. Collectively, these large images will contain billions of interesting objects, which astronomers call sources. The astronomers’ task is to construct a catalog detailing the coordinates and other properties of the sources. The source catalog is the primary data product for most telescopes and is an important input for testing new astrophysical theories, but to construct the catalog one must first detect the sources. Existing algorithms for catalog creation are effective at detecting sources, but do not have rigorous statistical error control. At the same time, there are several multiple testing procedures that provide rigorous error control, but they are not designed to detect sources that are aggregated over several pixels. We propose a family of techniques that do both, by providing rigorous statistical error control on the aggregate objects themselves rather than the pixels. We demonstrate the effectiveness of this approach on data from the Chandra X-ray Observatory Satellite. Our techniques effectively controls the rate of false sources, yet still detect almost all of the sources detected by procedures that do not have such rigorous error control and have the advantage of additional data in the form of follow up observations, which may not be available for upcoming large telescopes. In fact, we even detect two new sources that were missed by previous studies. The statistical methods we develop can be extended to problems beyond Astronomy, as we will illustrate with examples from Neuroimaging. We examine a series of high-resolution function Magnetic Resonance Imaging (fMRI) experiments in which the goal is to detect bands of neural activity in response to visual stimuli presented to subjects in an fMRI scanner. We extend the methods developed for Astronomy problems so that we can detect two distinct types of activation regions in the brain with a probabilistic guarantee on the rate of falsely detected active regions. Additionally we examine the more general field of clustering and develop a framework for clustering algorithms based around diffusion maps. Diffusion maps can be used to iii

iv project high-dimensional data into a lower dimensional space while preserving much of the structure in the data. We demonstrate how diffusion maps can be used to solve clustering problems and examine the influence of tuning parameters on the results. We introduce two novel methods, the self-tuning diffusion map which replaces the global scaling parameter in the typical diffusion map framework with a local scaling parameter and an algorithm for automatically selecting tuning parameters based on a cross-validation style score called prediction strength. The methods are tested on several example datasets.

Acknowledgements This dissertation marks the culmination of a great deal of work and would not have been possible without the help and friendship of a great number of people. I cannot possibly name them all, but would like to single out several important people for their positive impact in helping me get to this point in my career. • Chris Genovese has shown infinite patience as my advisor and is always willing to help me solve whatever problems I may be having that week as well as always offering sage advice when I need it most. • Rebecca Nugent’s passion and perspective remind me why I went to graduate school in the first place. She also deserves special mention for going out of her way to help me be engaged in the Statistics community outside of CMU, which she often reminds me does exist. • Vasant Waikar and Ultan Killean – even though I’m not going into teaching, your examples inspired me to teach and is a big reason why I am here today. • My fellow officemates and classmates throughout the years – whether it was long hours spent studying, playing games in the office, softball and hockey games or just having someone to, talk be it about Statistics or life, it is in large part due to you that I really did enjoy my six years in Pittsburgh. • In no particular order, Larry Wasserman, Chad Schafer, Peter Freeman, Val´erie Ventura, Steve Fienberg, Alessandro Rinaldo, Bob Noble, Ed Bolger, and Anthony Brockwell have all helped me on this journey through their teaching and willingness to meet, discuss and dole out good advice whenever needed. • The Statistics Department Office Staff – you guys are the best v

vi • Arthur Kosowksy, Eli Merrian and David Heeger for allowing me to dabble in their worlds and have some really cool datasets to use in my work. • Jen, you have actually put up with me for this long. I’m impressed. I’m so thankful that you’re always there for me no matter what and that I can talk to you about anything – even serious things. I was wrong. There, I said it. • Laurie, its too bad we live so far apart because whenever we’re together we always have so much fun. Maybe one day we can fix that? • Mom and Dad more than anyone on this list you deserve my thanks for everything you’ve done for me the past 28 years. I would have to write another dissertation to even approach recognizing all of the sacrifices and things you’ve done for Laurie and I. Now you can bug me about my progress on that dissertation since this one is done.

Contents Abstract

iii

Acknowledgements

v

List of Figures

xi

List of Tables

xiii

1 Introduction

1

2 Cluster Based Multiple Testing Methods for Astronomical Source Detection

5

2.1

Astronomical Source Detection . . . . . . . . . . . . . . . . . . . . . . . . .

5

2.2

Literature Review

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

2.3

Overview of Contributions to Astronomical Source Detection . . . . . . . .

10

2.4

The Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.5

False Cluster Proportion Algorithm . . . . . . . . . . . . . . . . . . . . . . .

13

2.5.1

Calculating U: The PP Method . . . . . . . . . . . . . . . . . . . . .

15

2.6

FCP Applied to CDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.7

A General Method for Calculating the Confidence Superset . . . . . . . . .

18

3 Enhancing False Cluster Proportion Methods for Astronomical Source Detection

25

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

3.2

Using Multi-scale Derivatives to Enhance Detection Power . . . . . . . . . .

26

3.3

FCP vs. MSFCP: Simulation Results . . . . . . . . . . . . . . . . . . . . . .

29

3.4

MSFCP on Chandra Data . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

3.5

Accounting for Varying Exposure Times . . . . . . . . . . . . . . . . . . . .

34

vii

viii

CONTENTS 3.6

Chandra 2Ms Exposure Time Application . . . . . . . . . . . . . . . . . . .

37

3.7

A Procedure for an Exposure-time Binned Confidence Supersets . . . . . .

43

3.7.1

Exposure-time Binned Confidence Superset Example . . . . . . . . .

46

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

3.8

4 False Cluster Proportion Methods for Functional Neuroimaging Data

51

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

4.2

Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

4.2.1

Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

4.2.2

False Cluster Proportion Algorithm for fMRI . . . . . . . . . . . . .

54

4.2.3

Calculating U . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

4.2.4

Smoothing the p-values and calculating the corresponding Confidence Superset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

4.3

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

58

4.4

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

5 Using Diffusion Maps to Detect Cluster Structure in Data

63

5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

5.2

Diffusion Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

5.2.1

Overview of Diffusion Maps . . . . . . . . . . . . . . . . . . . . . . .

65

5.2.2

Clustering in Diffusion Coordinates . . . . . . . . . . . . . . . . . . .

66

Fundamental Clustering Problems Suite . . . . . . . . . . . . . . . . . . . .

68

5.3.1

Tetra Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

5.3.2

Chain Link Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

5.3.3

FCPS Results Summary . . . . . . . . . . . . . . . . . . . . . . . . .

69

5.4

Glass Identification Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

5.5

Automatic parameter selection for Diffusion Clustering . . . . . . . . . . . .

73

5.5.1

Self-Tuning Diffusion Maps . . . . . . . . . . . . . . . . . . . . . . .

74

5.5.2

Prediction Strength . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

5.5.3

Diffusion Clustering using Prediction Strength to choose tuning pa-

5.3

5.5.4 5.6

rameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

78

Visualizing Prediction Strengths Over a Grid of Parameter Choices .

78

Results of Diffusion Clustering using Prediction Strength

. . . . . . . . . .

80

ix

CONTENTS

5.7

5.6.1

Tetra Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

5.6.2

Chain Link Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

5.6.3

EngyTime Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

82

5.6.4

Hepta Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

83

5.6.5

Glass Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

87

6 Discussion 6.1

6.2

6.3

6.4

91

False Cluster Proportion Methods for Astronomical Source Detection . . . .

91

6.1.1

Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

92

Enhancements to FCP methods for Astronomical Source Detection . . . . .

93

6.2.1

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

93

Detecting Cluster Structure in Functional Neuroimaging Data . . . . . . . .

94

6.3.1

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95

Detecting Cluster Structure using Diffusion Maps . . . . . . . . . . . . . . .

96

6.4.1

97

Bibliography

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

List of Figures 2.1

Toy example of difference between pixel and cluster based detection . . . .

7

2.2

A patch of the CDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

2.3

Simulated Smooth then Root on Poisson Data . . . . . . . . . . . . . . . . .

17

2.4

CDFS data after applying FCP . . . . . . . . . . . . . . . . . . . . . . . . .

19

2.5

Simulating p-values versus Piterbarg approximation . . . . . . . . . . . . .

21

3.1

Toy Multi-scale Derivative Example . . . . . . . . . . . . . . . . . . . . . .

27

3.2

Multi-scale Derivative on SZ data . . . . . . . . . . . . . . . . . . . . . . . .

28

3.3

Boxplots of Detection Power . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

3.4

Modeling Detection Power as a Function of Source Intensity . . . . . . . . .

31

3.5

CDFS data after applying MSFCP . . . . . . . . . . . . . . . . . . . . . . .

32

3.6

New Detection using MSFCP on CDFS . . . . . . . . . . . . . . . . . . . .

33

3.7

Chandra 2Ms Raw Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

3.8

Chandra 2Ms Data Exposure Time Map . . . . . . . . . . . . . . . . . . . .

38

3.9

Multi-scale Derivative Confidence Superset of 2Ms data . . . . . . . . . . .

41

3.10 ETMSD image of 2Ms data . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

3.11 Confidence Superset for ETMSD image of 2Ms data . . . . . . . . . . . . .

44

3.12 Small exposure times drive the confidence superset for the ETMSD image .

45

3.13 Binned Confidence Superset for ETMSD image of 2Ms data . . . . . . . . .

48

4.1

Graphical Summary of fMRI Experiment . . . . . . . . . . . . . . . . . . .

53

4.2

Phase Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

4.3

5mm Data Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

4.4

2.5mm Data Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

4.5

1.6mm Data Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

5.1

Tetra Dataset Visual Summary . . . . . . . . . . . . . . . . . . . . . . . . .

69

xi

xii

LIST OF FIGURES 5.2

Chain Link dataset Visual Summary . . . . . . . . . . . . . . . . . . . . . .

70

5.3

Effect of tuning parameters on clustering solutions of Glass dataset . . . . .

73

5.4

Visualizing prediction strength over a grid of tuning parameters . . . . . . .

79

5.5

Choosing tuning parameters for Tetra dataset . . . . . . . . . . . . . . . . .

81

5.6

Drop-off in Eigenmultiplier for Tetra Data . . . . . . . . . . . . . . . . . . .

82

5.7

Choosing tuning parameters for Chain Link dataset . . . . . . . . . . . . .

83

5.8

EngyTime Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

5.9

Choosing tuning parameters for EngyTime dataset . . . . . . . . . . . . . .

84

5.10 The Hepta datasets has 7 clusters in 3 dimensions . . . . . . . . . . . . . .

85

5.11 Histogram of ARIs for repeated runs of the Ng Method via its implementation in the Kernlab package in R . . . . . . . . . . . . . . . . . . . . . . . . . . .

86

5.12 Diffusion map representation of the Hepta dataset. Tuning parameters were chosen using Prediction Strength . . . . . . . . . . . . . . . . . . . . . . . .

87

5.13 Tuning parameter selection for Glass dataset . . . . . . . . . . . . . . . . .

88

List of Tables 3.1

Mean Purities for Simulated Data

. . . . . . . . . . . . . . . . . . . . . . .

29

3.2

# of Times The Percentage of Falsely Detected Clusters Exceeded 10% . .

30

3.3

Summary of Detection Routines on CDFS . . . . . . . . . . . . . . . . . . .

34

5.1

Comparison of clustering methods on FCPS datasets . . . . . . . . . . . . .

71

xiii

Chapter 1

Introduction The typical astronomical image records the intensity of light, over some range of frequencies, across a section of sky that contains many celestial objects of various size, shape, and luminosity. The image’s pixels correspond to an array of light-sensitive detectors in the telescope, and each pixel essentially counts how many photons have struck the corresponding detector during the exposure. But the photons recorded in the image do not come solely from the objects of interest, or sources; thermal noise and background emissions (collectively called background ) corrupt the data and obscure the signature of the objects. Moreover, diffraction and atmospheric effects blur the image, reducing resolution and washing out the fainter signals. In the (astronomical) source detection problem, one is given such an image and seeks to construct a catalog that gives the coordinates (and often other properties) of sources in the image. A source catalog is the basic data product of most astronomical surveys and the basic input to the scientific process. This has been true for some time. Early catalogs – from the data of ancient astronomers Shi Shen and Hipparchus, each cataloging about 1000 stars, to the compendium of deep-sky objects produced by William and Caroline Herschel in the 1700s (Herschel, 1786) – were based on direct visual observations. Later work, especially in the 20th century, used photographic plates, both improving resolution and allowing the detection of much fainter objects. But either way, compiling a source catalog would be a slow and painstaking affair, often requiring years to collect data on only a handful of objects. Until recently, catalogs comprising a few hundred objects were large, a few thousand were epic. All this changed with the advent of new technologies – digital imaging, advanced designs for telescope mirrors, and computer automation – and with increases in available computing power and storage. With relative suddenness, astronomers found that they could observe wider, deeper, and faster than ever before. They could sweep the sky searching 1

2

CHAPTER 1. INTRODUCTION

automatically for objects of a certain type, they could collect data on many objects in parallel, and they could observe a multitude of faint objects that would previously have gone undetected. The Sloan Digital Sky Survey (York et al., 2000) has measured hundreds of millions of objects. The upcoming Large Synoptic Survey Telescope (LSST, (Tyson & the LSST Collaboration, 2002)) will scan the entire sky every few days, collecting several terabytes of data per night into a catalog comprising billions of objects. Over the past two decades, astronomy has gone from data poor to data rich. And therein lies both opportunity and challenge. The opportunity lies in the richness and importance of the scientific questions that these massive data sets can answer. The challenge lies in the sheer scale of the data analysis. For example, the next generation of astronomical catalogs, including the LSST, will be so large that even simple operations – such as a basic query of the entire catalog – will be computationally prohibitive, and yes that does account for Moore’s law. While as a general rule in science, more data is better, there is a reason that astronomers often describe the coming bounty of data in quasi-biblical terms – a flood, a tsunami, an onslaught. The deluge of data is not a phenomena unique to Astronomy, another field that is experiencing a wealth of new data is Neuroscience. With the advent of new non-invasive brain imaging techniques, neuroscientists are able to create high-resolution, three-dimensional maps of the brain. One of the most popular methods is functional magnetic resonance imaging (fMRI) which measures subtle changes in blood flow in the brain due to neuronal activity. fMRI offers a balance between spatial resolution, which allows activity to be localized to a specific area of the brain, and temporal resolution, which allows us to monitor these changes over time. Typical fMRI experiments record activity in hundreds of thousands of voxels, uniformly spaced volume elements, that split the brain into equally sized cubes, usually about 3mm per side. Each voxel is recorded at regular intervals for the duration of the experiment, so for each voxel there is a time series of activation patterns. Typical temporal resolution can range from 0.5 to 4.0 seconds, so the times series for each voxel may consist of thousands of observations. Furthermore, the experiment may be repeated several times using different experimental conditions or different subjects1 . In many fMRI experiments, including the ones discussed in this document, the goal is to extract the clusters of voxels that activate in response to a presented stimulus. 1

For more detail on fMRI see Genovese (2000) and Lindquist (2008).

3 Astronomy and Neuroscience are but two examples of fields where the ability to collect massive, scientifically rich datasets is threatening to outpace our ability to analyze it. The theme of this dissertation is taking such datasets and extracting the important cluster structure in a way that both answers the scientific questions at hand while preserving rigorous statistical bounds on the error rates most relevant to scientists. In Chapters 2 and 3, we provide a review of the astronomical source detection problem and propose a family of multiple-testing methods which are competitive with existing methods for creating source catalogs but, we believe, superior for tunable control of the relevant error rate. In Chapter 4, we extend the previously detailed methods for an important application in highresolution fMRI data analysis. Chapter 5 examines methods for finding cluster structure in high-dimensional data and introduces new clustering methods based around the diffusion map data transformation (Lafon & Lee (2006) and Coifman & Lafon (2006)). A summary of the main contributions of this dissertation are as follows:

Summary of Main Contributions Chapter 2: • Demonstrate how False Cluster Proportion (FCP) algorithms can be used for the astronomical source detection problem • Develop a strength of detection measure for individual clusters detected using FCP • Introduce the smooth-then-root method for normalizing low-count Poisson data • Introduce a simulation procedure extending FCP to allow a wider variety of noise models Chapter 3: • Introduce the Multi-scale derivative transform for enhancing sources in an image • Introduce the MSFCP algorithm for source detection which combines the power of the Multi-scale derivative transform with the error control of FCP • Present a new analysis of part of the Chandra Deep Field South • Detect two new sources that were missed in previous analyses of the Chandra datasets

4

CHAPTER 1. INTRODUCTION • Extend the Multi-scale derivative transform to handle varying exposure times • Develop a procedure for running FCP using exposure-time bins as a means to increase power

Chapter 4: • Introduce procedures for analyzing fMRI experiments with control over the rate of falsely detected regions of brain activity • Extend FCP to handle multiple classes of clusters • Extend FCP to points that are not on a regular grid • Present successful results of applying these methods to three fMRI experiments Chapter 5: • Demonstrate the power of diffusion maps to detect clustered structure in datasets • Explore the influence of diffusion map tuning parameters in clustering problems • Introduce self-tuning diffusion maps which replace the diffusion map’s global tuning parameter with a series of local parameters, enhancing structure that may occur at different scales • Present a new algorithm for choosing diffusion map tuning parameters in clustering problems utilizing a cross-validation style score

Chapter 2

Cluster Based Multiple Testing Methods for Astronomical Source Detection 2.1

Astronomical Source Detection

New telescopes that will capture terabytes of image data every night are expected to catalog billions of astronomical sources. Besides the sheer size of the data, another serious issue is controlling the rate of errors in the catalog. In the past, where every object in a catalog was observed manually, sources were often missed, located incorrectly, or created spuriously. To mitigate these types of errors, follow-up observations can often be made for all or most of the sources, reducing false positive and false negative identifications to a manageable level. But in the near future, the massive number of objects in the catalog will preclude comprehensive follow-up observations by human beings. Scientific studies of the catalog will likely need to be based on samples or selections made by automatic, statistical criteria. But no matter how carefully these criteria are constructed, there will be objects that are misclassified. To use the resulting samples effectively for scientific inference, it will be necessary for the method to provide tunable control of error rates. Thus, new statistical and computational methods will be needed to construct and analyze the next generation of astronomical source catalogs. In this chapter, we propose a multiple-testing-based method for the source detection problem that has several advantages over existing techniques, especially for the analysis of large-scale surveys like the LSST. Although we discuss the method in the context of astronomical source detection, the methods developed in this chapter can be adapted and 5

6

CHAPTER 2. CLUSTER BASED MULTIPLE TESTING METHODS FOR ASTRONOMICAL SOURCE DETECTION

applied to a wide range of similar problems such as neuroimaging (e.g. Buckner, 1998) and remote sensing (e.g. Richards & Jia, 1999); and we give such an example in a later chapter. We assume that the input image is an n × m array of pixels, with the value recorded at pixel (i, j) denoted by Yij . The photons that contribute to Yij arise from two components: sources, the emissions produced by the celestial objects of interest, and background, which includes thermal noise, the emissions of unresolved objects, interfering radiation sources, atmospheric emissions, and all other anomalies or artifacts. Astronomical images essentially measure photon counts for which a Poisson model is appropriate (Cash, 1979). So for our base model, we assume that the Yij ’s are independent with Yij distributed as Poissonhλ1,ij + λ0,ij i,

(2.1)

where λ1,ij ≥ 0, λ0,ij ≥ 0 denote the mean intensity of sources and background, respectively and λ1,ij + λ0,ij > 0. The idea here is that the pixels are measuring the counts in disjoint cells of a Poisson random field across the sky. This applies to good approximation for spacebased observations like those reported in Section 2.4. For ground-based observation, the image is in addition blurred by atmospheric turbulence, so the Yij ’s are no longer strictly independent. However, in many cases, the Poisson model in Equation 2.1 still holds to reasonable approximation. In data sets where the counts are high, the Poisson random field can be further approximated by a Gaussian random field. In this chapter, we utilize a technique originally developed for the Gaussian model and generalize it so that we can accommodate a wide variety of models. The source detection problem is to identify which pixels contain sources and to separate the sources from the background. If λ1,ij > 0, then we take pixel (i, j) to be a source pixel; otherwise, it is a background pixel. So it is natural to consider this as a multiple testing problem with the null hypothesis at each pixel being that λ1,ij = 0. At a coarse level, we want to characterize the set S = {(i, j) : λ1,ij > 0} of source pixels, but our more specific goal is to identify and locate the underlying sources, so that an accurate catalog can be constructed. This requires a more stringent criterion for success because the objects are coherent, localized aggregates. As Figure 2.1 shows, with the same number of pixel-wise type I and type II errors, it is possible to get widely varying accuracy in the resulting catalog. Put another way, our loss function operates on the catalog, not the pixels themselves. In this chapter and the next, we extend the False Cluster Proportion (FCP) controlling

2.1. ASTRONOMICAL SOURCE DETECTION

7

Figure 2.1: The 45 pixels that have any overlap with the red circle are considered sources and black pixels indicate sources detected via some detection algorithm. Each of the three images have 6 Type I error pixels and 24 Type II error pixels. The detection in the left image captures the center of the source but clearly misses the shape. The detections in the center image show several different sources that have some overlap with the true source. The detection in the right image captures the center of the true source and has some spurious noise detections. While all these pictures have the same number of pixel-wise Type I and Type II errors they lead to very different conclusions about the number, shape and location of the sources in the image, thus a testing criteria based on the aggregate sources, instead of pixels, is necessary.

procedures introduced by Perone Pacifico et al. (2004) to make it effective for controlling the rate of false sources detected in astronomical images. As we will show, the original FCP procedure is not suitable for the Poisson statistics common in the source detection problem. Even when the assumptions of the procedure are met, it does not yield sufficient power to be viable for the astronomical source detection problem. We generalize the technique so that it applies to a wider range of noise models. We also introduce a new transform, which we call the Multi-scale Derivative, that enhances sources and significantly improves power. Taken together, these extensions lead to a new procedure that we call the Multiscale False Cluster Proportion (MSFCP) procedure. MSFCP is a powerful source detection technique that provides rigorous control over the rate of false sources, where techniques in current use provide control over the rate of false pixels, if they provide any control at all. We demonstrate both the excellent power and error control on a very deep and high resolution telescope image from the Chandra X-Ray Observatory. MSFCP has detection power competitive with the existing procedures, even detecting a source that was overlooked in previous studies while at the same time maintaining rigorous control over the rate of false

8

CHAPTER 2. CLUSTER BASED MULTIPLE TESTING METHODS FOR ASTRONOMICAL SOURCE DETECTION

sources. Additionally, we show how these methods can be adapted to account for images with varying exposure times and demonstrate these techniques with another dataset from the Chandra X-ray observatory.

2.2

Literature Review

Due to its prevalence and difficulty, there have been a variety of approaches to the source detection problem, both in the statistical and astronomical literature. Given the recent explosion in research on multiple testing, there is a plethora of available techniques that can be applied directly to the pixel-wise hypothesis tests to reconstruct S, including Benjamini & Hochberg (1995)[BH], Benjamini et al. (2006), Storey (2002), Sun & Cai (2007), and Meinshausen & Rice (2006). However, because these methods give error-rates in terms of the individual pixel-wise tests, it is not obvious how to translate these error-rates to make inferences about the underlying sources. Multiple testing approaches that are less pixel-centered have been developed in the related problem of analyzing functional magnetic resonance imaging (fMRI) data. In this problem, the sources are regions of neural activity that reveal themselves through a measurable change in blood flow response. Worsley et al. (1996) and Worsley et al. (2002) use level sets of a random field formed from test statistics to identify regions containing sources. Heller et al. (2006) construct a test based on clusters instead of pixels in the fMRI setting, but this technique takes advantage of a temporal dimension that is not available in many general (e.g., astronomical) problems. Source detection understandably garners much attention in the astronomical literature. Astronomers address source detection as one step in a data-processing pipeline – the series of operations performed on the data from collection until catalog. These include, but are not limited to, corrections for atmospheric effects, image registration, filtering out unwanted signals, as well as source detection. These pipelines are typically planned and developed well before the instrument is operational. During the planning stage, simulations are run to test the pipeline including the source detection algorithm (e.g., Sehgal et al., 2007). Astronomers typically do not require their algorithms to satisfy any formal performance criteria; rather, they apply an empirical criterion, using data simulated to look like a real telescope image to calibrate the error rate for detected sources that will be expected in practice. Of course, this depends on the simulated and real data being both quantitatively and qualitatively

2.2. LITERATURE REVIEW

9

similar. While great effort and ingenuity are applied in constructing realistic simulations – sometimes years of computing time for a single run – the simulations still rely on untested and unstated assumptions that may fail when the instrument comes on line. The source detection methods used by astronomers fall into three general classes: simple thresholding, peak-finding algorithms, and Bayesian algorithms. Simple thresholding consists of choosing an intensity cutoff for pixel-wise statistics and classifying any pixel above threshold as a source. It is popular because it is simple and fast, easily computed by the popular SExtractor software (Bertin & Arnouts, 1996). Before thresholding, filters are often applied to the raw data to suppress confounding background signals. In Vikhlinin et al. (1995) and Melin et al. (2006), Matched Filters are used to isolate the signal, and then thresholds are determined using simulated data to create catalogs in X-ray and Radio telescope images respectively. Hopkins et al. (2002) also use simple thresholding, choosing the threshold to control the False Discovery Rate via the method of Benjamini & Hochberg (1995), and use simulated telescope images to calibrate between rate of false pixels and the rate of false sources. Peak-finding algorithms search for local maxima in the denoised image and catalog them as sources. Vale & White (2006) and Sehgal et al. (2007) use peak-finding algorithms to look for large galaxy clusters in simulated radio telescope images. The wavelet-based technique of Freeman et al. (2002) is commonly used to detect X-ray sources as in Valtchanov et al. (2001) and Giacconi et al. (2002). Damiani et al. (1997) and Gonzalez-Nuevo et al. (2006) provide good overviews of the popular Mexican Hat wavelet as a tool for source detection. Several implementations exist in software and most are specifically tailored for a certain instrument or type of problem. Pixel-wise error rates for wavelet source detection can sometimes be determined analytically but are more often approximated from simulations. As with simple thresholding, simulations are often used to indirectly estimate the rate of false sources, the error rate of interest, from the rate of false pixels. Bayesian techniques have become popular with astronomers and several have applied Bayesian methods to source detection as in Savage & Oliver (2007), Strong (2003), and Hobson & McLachlan (2003). Bayesian detection algorithms typically define models for sources, often two-dimensional Gaussians, and attempt to distinguish them from backgrounds via inference on a posterior distribution. Guglielmetti et al. (2009) use Bayesian mixture models to separate sources from background, while Carvalho et al. (2009) propose

10

CHAPTER 2. CLUSTER BASED MULTIPLE TESTING METHODS FOR ASTRONOMICAL SOURCE DETECTION

ways to speed up Bayesian source detection, which can be computationally slow for large problems. Bayesian algorithms typically require more assumptions to be made a priori and can be more computationally expensive than thresholding or peak-finding algorithms.

2.3

Overview of Contributions to Astronomical Source Detection

Our goal in source detection is to detect the relevant objects, not pixels, while controlling the error rates. Statisticians have developed numerous methods for dealing with error rates but have not been focused on detecting aggregate objects, while astronomers have been thinking about detecting objects, but without a rigorous approach to controlling error rates. We propose a family of techniques that do both: control the error rates and make inferences about the sources themselves. We demonstrate our techniques on an important data set from the Chandra X-ray Observatory, one of the most powerful telescopes in existence. For the Chandra data, our goal is to detect the X-ray sources with a bound on the error rate for sources. We describe an approach due to Perone Pacifico et al. (2004) that gives us a probabilistic bound on the error rate for sources, and apply it to the Chandra data. We find that while this approach gives us the error control we want, it does not have good power when compared to other techniques. We introduce a generalization of the technique that allows for us to keep a probabilistic bound on the error rate for sources under more general conditions. In Chapter 3, we introduce a new Multi-scale technique that is designed to enhance sources and thus increase power. We then integrate it with our generalized procedure to get the MSFCP procedure which increases power while maintaining control over the error rate. This improvement is evident when we revisit the Chandra data – we show that our power using MSFCP is competitive to two algorithms commonly used by astronomers, but with superior error control. Furthermore, using our procedure we detect two X-ray sources which had gone undetected in the original analysis of the data by astronomers. We conclude by examining how these procedures can be modified to account for variations in exposure time and demonstrate with another Chandra dataset.

2.4. THE DATA

2.4

11

The Data

We demonstrate our techniques on data from the Chandra X-ray observatory (Weisskopf et al., 2000), one of the most powerful X-ray telescopes available. Chandra orbits Earth, at approximately one third the distance to the Moon. Being outside of the Earth’s atmosphere allows for extended observing time without atmospheric disruption of the X-ray signals themselves. X-rays are emitted when matter is heated to millions of degrees, such as in the hot gas surrounding large galaxy clusters, during supernovae, or when matter circles a black hole. Resolving discrete sources of X-rays in space has been an important problem in Astronomy for several decades. Because studying X-ray emissions from galaxy clusters can answer many of the fundamental questions in Astronomy and Cosmology, several generations of instruments have been developed specifically for this task. From X-ray data, astronomers can infer the evolution of galaxies, which informs us about the evolution of the universe. For instance, recent X-ray observations of the Bullet Cluster have provided the most compelling evidence of dark matter (Markevitch et al., 2004), a mysterious form of matter that does not strongly interact with visible matter but is postulated to account for approximately 22% of the mass of the universe. The data we use comprise part of the Chandra Deep Field South (CDFS), a composite image of 11 Chandra observations of a small patch of sky with nearly one million seconds of observing time. This is a long exposure which means we should be able to resolve very distant and faint sources. This piece of sky was selected because it had low interference from our own galaxy and no bright stars in its vicinity. The CDFS has area approximately one half the angular size of the moon. It is also in a location that can be observed by several complementary ground-based telescopes. The part of the CDFS that we will analyze and discuss in this chapter can be seen in Figure 2.2. The first step in converting the raw image data into an actionable scientific dataset is cataloging all the sources in the image. The sources in the CDFS are thought to be active galaxies and quasars with massive black holes in their centers. We will provide a new analysis of this dataset that not only detects these important X-ray sources, but also makes a probabilistic guarantee on the rate of falsely detected sources. The original analysis of the CDFS was published in Giacconi et al. (2002) [GI]. Their main catalog was created by combining the catalogs from two different source detection

12

CHAPTER 2. CLUSTER BASED MULTIPLE TESTING METHODS FOR ASTRONOMICAL SOURCE DETECTION

Figure 2.2: A 512 pixel by 512 pixel patch of the Chandra Deep Field South. The image was smoothed and then all pixels with count greater than 15 were set to 15. This image is the log of that image plus one. The transformations were performed so that the brightest sources do not wash out the fainter ones for display purposes. They were not used in any of the analysis. The complete data set is publicly available from http://cxc.cfa.harvard.edu/cda/

2.5. FALSE CLUSTER PROPORTION ALGORITHM

13

algorithms: a modified version of the SExtractor algorithm (Bertin & Arnouts, 1996), and the WAVDETECT algorithm (Freeman et al., 2002). The SExtractor algorithm estimates the background and catalogs as a source any region that is 2.4σ above the background that is also at least 5 pixels in area. These parameters are determined using images simulated to look like the real telescope images. They are selected to create a large catalog which will likely include several false sources. Independently, the WAVDETECT procedure was run using a wavelet transformation specifically tuned for this type of image to detect sources. Again, parameters are calibrated using simulated images. The two catalogs are merged and then refined using follow-up observations in an effort to remove sources that may be false. In this step, they are attempting to replicate the detection with independent observations, thus if an object is in the catalog we are fairly certain it is real since it has been observed multiple times, often with a completely different instrument. It is important to note that this detection strategy is designed for a small patch of sky with the resources to follow-up all potential detections. In this scenario, it makes sense to use a detection scheme that casts a wide net. Since we can go back and verify or reject each potential detection, we will tolerate a larger number of spurious sources initially in exchange for the ability to detect some fainter sources. This strategy will fail with newer telescopes that will be scanning large areas of sky and making so many detections that it will be impractical to follow-up each and every detection. Instead we want a strategy that will automatically detect real sources reliably while controlling the rate of false ones in one pass through the data.

2.5

False Cluster Proportion Algorithm

A reasonable first approach for detection is to conduct a hypothesis test at each pixel individually. We want to test whether a pixel is a background pixel (null) or a source pixel (alternative). After conducting the appropriate pixel-wise test, a multiple testing correction can be applied such as the BH method. A problem with the pixel-wise testing approach is that the unit of inference are individual pixels. A pixel is an artificial unit related to the resolution of instrument. What we are interested in are objects composed of collections of adjacent pixels. Therefore, we want the unit of inference to be a cluster as opposed to a pixel. An alternative to pixel-wise testing is the False Cluster Proportion (FCP) procedure introduced by Perone Pacifico et al. (2004). The method is designed to bound the rate

14

CHAPTER 2. CLUSTER BASED MULTIPLE TESTING METHODS FOR ASTRONOMICAL SOURCE DETECTION

of false positive regions of a random field. FCP treats the image as a realization of an underlying random field and then derives a confidence superset for the location of the true nulls (background regions). Denote the the unknown true nulls as the set S0 and the confidence superset as U . We call U a 1 − α confidence superset for S0 if P (U ⊇ S0 ) ≥ 1 − α

(2.2)

Let Lt denote the level set of all pixels that have intensity greater than t. Lt can be decomposed into its connected components. A connected component is a subset of pixels in Lt in which, for any two pixels in the component, there exists a path of contiguous pixels from one to the the other in which every pixel in the path is also in the connected component. Thus, every pixel in Lt is grouped with all its neighboring pixels that are also in Lt creating connected components with intensity greater than t. This decomposition yields a set Ct of kt connected components, Ct = {C1,t , C2,t , ...Ckt ,t }. We refer to each connected component as a cluster, and our goal is to decide which of these clusters correspond to the sources we want to detect and which clusters are simply noise. For any cluster C the cluster is declared false if λ(C ∩ S0 ) ≥ λ(C)

(2.3)

In the pixelized images we will deal with in this chapter, λ denotes the counting measure and  is a pre-specified tolerance parameter. The goal is to come up with a threshold t that insures the proportion of detected objects that are false is sufficiently low. Define the true false cluster proportion, Ξ(t):

Ξ(t) =

#{1 ≤ i ≤ kt :

λ(Cit ∩S0 ) λ(Cit )

≥ }

kt

(2.4)

¯ This quantity can be bounded by calculating the false cluster proportion envelope Ξ(t), where

¯ = Ξ(t)

#{1 ≤ i ≤ kt :

λ(Cit ∩U ) λ(Cit )

kt

≥ }

(2.5)

Then from Perone Pacifico et al. (2004) ¯ P (Ξ(t) ≤ Ξ(t) ∀t) ≥ 1 − α

(2.6)

Suppose we have a 1 − α confidence superset U , and want to bound the rate of false clusters to be less than 10%. Equation (2.6) tells us that if we can find the value t.1 such

2.5. FALSE CLUSTER PROPORTION ALGORITHM

15

¯ .1 ) = .1 then the catalog containing the Ct clusters will have the property that that Ξ(t .1 less than 10% are false with probability 1 − α. In general, we want to find the value tc ¯ c ) = c where c is the false cluster proportion value which we do not wish to such that Ξ(t exceed. To do this, one must perform a search over the possible values of t. Once tc has been determined, a catalog is constructed using the ktc clusters from the level set Ltc as the detected objects. They have the property that with probability 1 − α the proportion of the detected objects which are false positive detections are less than or equal to c. In certain instances it may be of interest how confident we are in a given detection. For this purpose, we introduce a strength of detection measure for each cluster. For a given cluster C, we look at its brightest pixel denoted cmax = max{Yij ∈ C}

(2.7)

¯ max ) P (C) = Ξ(c

(2.8)

and define

Similar to how a p-value is the smallest level at which we can detect a significant difference from the null, P (C) is the smallest false cluster proportion tolerance at which the cluster would be detected as significantly different from the background. As with a p-value, P (C) can range between zero and one, with smaller values indicating stronger detections.

2.5.1

Calculating U: The PP Method

To use FCP one needs a confidence superset U , which satisfies Equation (2.2). Perone Pacifico et al. calculate U by looking for the smallest set A for which the null of the following test cannot be rejected. H0 : A ⊂ S0 versus H1 : A 6⊂ S0

(2.9)

In practice, the maximum pixel value in the set A is used as the test statistic. The p-value is then

p(x, A) = P0 (max(Yij ∈ A) ≥ x)

(2.10)

U is calculated by starting with the entire image and then progressively removing the pixels with the highest intensity (those least likely to be part of the background) until the p-value of the remaining set passes α. The p-value in Equation 2.10 is calculated using the

16

CHAPTER 2. CLUSTER BASED MULTIPLE TESTING METHODS FOR ASTRONOMICAL SOURCE DETECTION

Piterbarg approximation (Piterbarg, 1996). The Piterbarg approximation assumes that the field is a locally stationary Gaussian random field with quadratic covariance. If X(s) is a homogeneous Gaussian random field then

X σ

is locally stationary with quadratic covariance

if for some matrix B,

ρ(s) = 1 − sT Bs + o(||s||2 )

(2.11)

This method yields a valid confidence superset U for the case of Gaussian Random Fields that satisfy Equation (2.11). The confidence superset can then be incorporated into FCP to find the appropriate value for tc . The clusters detected using the cutoff tc will, with high probability, have a false cluster rate less than the bound c. Henceforth we will refer to this approach as the PP method; for full details see Perone Pacifico et al. (2004). Once we have run FCP, astronomers can plug the value tc as an argument to the SExtractor software and generate the catalog without having to learn any new software and at computational speed to which they are accustomed. Many astronomers use SExtractor for source detection either using the default settings or by “playing with various parameter combinations and inspecting the results by eye” (Shim et al., 2006). FCP gives astronomers a more principled way to choose the parameters in the software they are already using.

2.6

FCP Applied to CDFS

In the CDFS data, the objects we are looking for are active galaxies and quasars which are emitting X-rays. We would like to detect as many of these objects as possible without making spurious detections, that is classifying what is really just noise as an X-ray emitting object. Each pixel records the number of photons that strike the detector and are modeled as Poisson random variables. The mean background in these images is small, approximately 0.3, which violates the assumption of the PP procedure which assumes a Gaussian background. Our initial approach to dealing with the non-gaussianity is to transform the data so that the Gaussianity assumption is more appropriate. Specifically, we first smooth the data and then apply a pixel-wise transformation. Smoothing the data with a Gaussian filter effectively aggregates counts so we preserve most of the structure of the Poisson data even though they are no longer counts. We then take the square root transformation which makes the background approximately Normal. The square root transformation has been

2.6. FCP APPLIED TO CDFS

17

shown to normalize high count Poisson data (Anscombe, 1948). Using simulations, we have tested the effectiveness of our smooth-then-square-root procedure for low rate Poisson data and found that for rates as low as 0.2, it yields approximately normal data. Simulation results for different rate parameters are shown in Figure 2.3.

Figure 2.3: Simulation of a 100x100 image of Poisson Background which is then smoothed with a Gaussian kernel, with standard deviation of 1 pixel. The square root is then taken at each pixel. The histogram shows the pixels become approximately normal for values of lambda above .2. The Quantile-Quantile plots show the data compared to a perfectly normal distribution. The closer the points lie to the line y=x (green), the closer the data is to a normal distribution After the normalizing transformation, we standardize each pixel, assuming a constant background mean, and calculate the 95% confidence superset for the background using the PP method. We then run FCP on the transformed data as described in Section 2.5, bounding the rate of false sources to be less than 10%. Because our U is very conservative (it contains every pixel from the null with probability .95) we select  = .99 to maximize our detection power. Nothing in our procedure relies on information about the telescope or knowing the nature or distribution of the sources themselves. We only assume that

18

CHAPTER 2. CLUSTER BASED MULTIPLE TESTING METHODS FOR ASTRONOMICAL SOURCE DETECTION

our normalizing transformation does indeed give us a background that is approximately Gaussian, an assumption which can easily be verified. FCP, using the PP method to calculate U , detects 17 sources, all of which are in the previously published GI catalog as seen in Figure 2.4. Since this catalog has been verified with follow up observations we are confident that all the detections are real. However, this analysis is clearly missing many clusters that should be detectable – nine of the 26 clusters in the Main catalog of GI were missed using FCP with the PP method. One way to boost the power is to run the data through an appropriate filter that will enhance the sources and help separate them from the background. However, to do that and maintain the desired control over the rate of false sources, we need to adapt FCP so that it can handle more varied noise conditions. This means moving away from the PP method to a more general procedure for calculating U . In the following section we propose a procedure for calculating U under any noise distribution that we can simulate. This will allow us to run FCP on filtered data and achieve higher power while still maintaining control over the rate of false sources.

2.7

A General Method for Calculating the Confidence Superset

The False Cluster Proportion Algorithm requires the derivation of a confidence superset for the background of an image. What we have been referring to as the PP method, which is based on the Piterbarg approximation, is limited to cases where pixels are or can be transformed to look like a realization of a Gaussian Random Field that satisfies Equation 2.11. But in many situations the data are not Gaussian and finding an acceptable transformation is infeasible. For example, filtering techniques commonly used to enhance sources will usually not satisfy the assumptions of the PP method. To broaden the scope of the technique, we have developed a simulation procedure that produces an accurate 1 − α confidence superset for any noise distribution from which we can sample numerically. Assume our data are a rectangular image with N total pixels with known noise distribution F. First, simulate noise images and calculate the maxima of subsets of the images. From the simulated data we build up empirical distributions for these maxima and use them to calculate the p-values which, in the PP method, were calculated

2.7. A GENERAL METHOD FOR CALCULATING THE CONFIDENCE SUPERSET

19

Figure 2.4: CDFS data after applying the False Cluster Proportion procedure using the PP method to calculate U. Black indicates sources detected by FCP and the circles are detections from both the Main Catalog and Secondary Catalogs from GI. There are 17 detections from FCP all of which are in the Catalog and 9 missed sources.

20

CHAPTER 2. CLUSTER BASED MULTIPLE TESTING METHODS FOR ASTRONOMICAL SOURCE DETECTION

using the Piterbarg approximation. We can then plug the p-values into the FCP procedure to insure the False Cluster Proportion guarantee (Equation 2.6), for a much wider variety of noise conditions than the PP method.

Algorithm: Simulating p-values: For each b ∈ {1, . . . , B} 1. Simulate an image, Y b , with the same dimensions as the data, with noise distribution F and no sources b 2. Record the maximum value of the image, Y(N )

3. For k = 1, 2, . . . a randomly remove k pixels and record the maximum remaining b pixel value Y(N −k) . N − a is the smallest number of pixels for which you want

to calculate the p-value, it needs to be more than the number of pixels that you believe to be sources, but the number of computations increase with a

In this way, we build the distribution of the maximum for a given area under the null. Then we can calculate the needed p-values

p(x, A) = P0 (max(Yij ∈ A) ≥ x)

(2.12)

which were previously approximated using Piterbarg’s formula and are now estimated using the simulated data

p(x, A) ≈

i #{1 ≤ i ≤ B : Y(Area(A)) ≥ x} B

(2.13)

Then we proceed as in Section 2.5 to calculate U . For a Gaussian Random Field that satisfies Equation 2.11 we can compare the Piterbarg formula to our simulated p-values. In Figure 2.5 we simulate the p-values for different sized areas A, as described above and also by replacing Step 3 with removing a square region of area A from the image. We see that the p-value from Piterbarg, which accounts for size but not shape, tends to fall in between these two but the differences are small and get smaller as the area gets bigger. For astronomical source detection, the random sampling approach is closest to the data we observe so we use it as our default. However, it is worth

2.7. A GENERAL METHOD FOR CALCULATING THE CONFIDENCE SUPERSET

21

noting that our simulation technique is easily adaptable to handle prior knowledge about the shapes of the objects of interest. In addition, unlike Piterbarg, our simulation procedure gives accurate p-values for any x, not just those in the tail. However, the biggest benefit of simulating the p-values is we are not restricted to smooth Gaussian Random Fields and instead can handle a wider range of noise conditions.

Figure 2.5: P-values for a Gaussian Random field that satisfies Equation 2.11 for different sized regions. The dotted black line is the value from Piterbarg that does not consider any shape information, the solid green line is calculated using a simulation that randomly samples pixels, the dotted red line is using the same simulation but the area sampled used is from a randomly selected square region of appropriate area. Shape information appears to have a mild effect that gets smaller as the area considered gets larger. Also Piterbarg exhibits strange behavior for small areas. The simulation was run 10,000 times. We can simplify the simulations further. We note that P (x, A) > P (x, A − k) ∀k > 0. Since we are dealing with large images, where the source signal is sparse relative to the

22

CHAPTER 2. CLUSTER BASED MULTIPLE TESTING METHODS FOR ASTRONOMICAL SOURCE DETECTION

background, the maximum over N pixels is not going to be substantially different than the maximum over N-k pixels when N >> k. Thus if we use the following algorithm we will get a comparable confidence superset directly and with less computation.

Algorithm: Simulating a Confidence Superset 1. For each b ∈ {1, . . . , B}, simulate an image, Y b , with the same dimensions as the data, with noise distribution F and no sources, then record its maximum 2. Calculate r1−α , the 1 − α percentile of the maxima 3. Using the original data image, calculate the level set U = {Yij : Yij < r1−α }

Using the second algorithm is faster because we avoid the computation of the aB maxima at the cost of having a bigger confidence superset U . However as long as N >> k, which will be the case for telescope images where sources are relatively sparse on the sky, the size difference is negligible. The second algorithm also allows us to calculate U directly without having to search over sets or approximate any p-values.

These simulations are only meant to approximate a confidence superset for a given null distribution – it is not meant to simulate the complex systems in the real data. Therefore we do not need to use prior knowledge or make assumptions about the underlying Astronomy as in the simulations commonly used by astronomers for source detection. Our simulated confidence superset U can be calculated for any noise condition from which we can sample numerically. Furthermore we can also simulate what happens if we apply filters to the data as is commonly performed in astronomy. Filters can often be used to enhance sources and remove background contamination, making detection easier. For instance if we know our data have a Poisson noise distribution, we can simulate the Poisson noise image, apply the filter, and then calculate the maxima of the filtered image.

2.7. A GENERAL METHOD FOR CALCULATING THE CONFIDENCE SUPERSET

23

Algorithm: Simulating a Confidence Superset for Filtered Data 1. For each b ∈ {1, . . . , B}, simulate an image, Y b , with the same dimensions as the data, with noise distribution F and no sources. 2. Apply the filter to Y b to get F (Y b ) 3. Record the maximum pixel value from F (Y b ) 4. Calculate r1−α , the 1 − α percentile of the maxima 5. Using the filtered data image, F (Y ), calculate the level set U = {F (Y )ij : F (Y )ij < r1−α }

We then can calculate the confidence superset for filtered Poisson data. This combination of filtering and false cluster proportion control will give us a very powerful tool for source detection.

Chapter 3

Enhancing False Cluster Proportion Methods for Astronomical Source Detection 3.1

Introduction

We have seen with the Chandra data that FCP, using the intensity at each pixel and the PP method, does not achieve power comparable to the methods in Giacconi et al. (2002)[GI]. In Section 2.7, we outlined a simulation-based approach which replaces the PP method, allowing for more general test statistics that do not necessarily have to be Gaussian. In this chapter, we describe a new technique which directly address two problems contributing to the lack of power seen in Section 2.6. The first problem is that rules based on the magnitude of a pixel, are not necessarily the best at discriminating sources aggregated over pixels from background. We expect, due to random fluctuations, to get some individual background pixels with relatively large counts. Compared to rules based on the magnitude of pixels, rules that take into account local information can better differentiate between these occasional pixels with large counts and a cluster of pixels that all have large counts. The second problem is that the we assumed a constant background mean. While this is often a reasonable assumption, there may be locations in the image where this does not hold; for example at the edge of exposures where certain pixels may have been exposed longer than others. In general, telescope images can have non-constant backgrounds that vary spatially, in which case the constant background mean is clearly violated. In this chapter we develop a Multi-scale Derivative approach which uses local information to identify regions that stand out from the background as sources. This approach 25

CHAPTER 3. ENHANCING FALSE CLUSTER PROPORTION METHODS 26 FOR ASTRONOMICAL SOURCE DETECTION uses local information to detect clusters instead of looking at the magnitude of each pixel individually. Additionally we develop methods for images with varying exposure times. This is a common problem in Astronomy which causes non-constant backgrounds and the potential to bias results. We show how to account for variations in exposure time while still preserving rigorous statistical control over the rate of false positive sources.

3.2

Using Multi-scale Derivatives to Enhance Detection Power

The idea behind Multi-scale Derivatives is to apply a Gaussian filter to the data using a range of bandwidths and examine the derivative of the filtered image with respect to the bandwidth. To get an idea of why this works, we examine the toy example shown in Figure 3.1. Suppose we are at a location with no source. At small bandwidths, our smooth estimate of the null location remains low, and this will stay roughly the same until we reach a bandwidth that is big enough to hit a source, at which point our smoothed estimate will jump up as the source gets smoothed into the null space. The bandwidth at which this jump occurs tells us how close we are to a source. Alternatively, if we are at a source, we will have a high value for our smooth estimate if our bandwidth is small and as we increase the bandwidth, our estimate will get lower and lower as null areas are smoothed into the source. If we compare the derivative of our smoother with respect to the bandwidth, sources will stand out by their large negative derivatives. The smoothing operation can be calculated quickly by performing the convolution in Fourier space, and we can use the same trick to calculate the required derivative of the smoother. The derivative is calculated by convolving the data with a filter of the form Fh (u, v) ∝ (

d(u, v)2 2 − ) ∗ φ(u, v|h) 3 h h

(3.1)

where d(u, v) is the distance between the origin and point (u, v) and φ is a symmetric Gaussian kernel with bandwidth h. We then choose a value for h and convolve the filter with the image to get an image which estimates the derivative at scale h for each pixel. We do this for a few values of h which are selected to cover the range of source sizes. The Multi-scale Derivative image, M , is then created at each pixel by selecting the minimum

3.2. USING MULTI-SCALE DERIVATIVES TO ENHANCE DETECTION POWER

A

B

0

10

20

30

40

50

80

120

Intensity

1.5e−12 0.0e+00

Background Point

60

10

20

Source Point

Intensity

30

40

50

3.0e−12

160

Source Image

27

0.5

1.0

1.5

2.0

0.5

h

1.0

1.5

2.0

h

Figure 3.1: Left: A toy source with a background and a source point highlighted. A: The intensity of the background point after smoothing with different values of the smoothing parameter h. The intensity stays constant around zero until the source begins to get smoothed into the background, at which point we see a small increase in the intensity. B: The intensity of a point in the source after smoothing with different values of the smoothing parameter h. The intensity drops sharply for small amounts of smoothing and continues to decline for larger values of h. Thus estimates of the derivative of the source point with respect to h should give large negative values whereas the background point will not. derivative at that pixel over all the scales. M (u, v) = min Fh (u, v) h

(3.2)

As a result, we should see high negative values at the location of sources in M . The Multi-scale Derivative image M measures the peakedness in the data caused by sources in cases where the background is smoothly varying and will enhance sources in images where there is a flat background. As a demonstration, in Figure 3.2 we show a small dataset due to Sehgal et al. (2007) simulated to look like data from the Atacama Cosmology Telescope (Kosowsky, 2003), a ground-based radio telescope used to detect large galaxy clusters. The clusters we wish to detect are visible as bright spots in the simulated image on the left. In practice, we cannot observe these clusters directly because they are obscured by several other confounding signals. To isolate the clusters, Sehgal et al. use a Wiener filter, which combines data from three different radio frequencies, as seen in the middle image. Our goal is to detect the galaxy clusters from the filtered image. The brightest sources are visible in

CHAPTER 3. ENHANCING FALSE CLUSTER PROPORTION METHODS 28 FOR ASTRONOMICAL SOURCE DETECTION

True Galaxy Clusters

Noisy Data after Processing

Multi−scale Derivative Image

Figure 3.2: Left: The galaxy clusters we want to detect. In practice these will be obscured by confounding radio signals Middle: The Wiener filtered image which combines information from three different frequencies in an attempt to pick out the signal from the galaxy clusters while eliminating the confounding signals. This filter has introduced a non-constant background to the image. Right: The Multi-scale Derivative image calculated from the Wiener filtered image recovers most of the sources we are trying to detect while getting rid of most of the background variation. Data courtesy of Sehgal et al. (2007)

the Wiener filtered image, but the background is spatially varying and is much brighter in certain areas than in others. Any detection algorithm based on the absolute magnitude of pixels will not perform well on this image. Instead we calculate the Multi-scale Derivative image from the filtered image as seen in the right panel of Figure 3.2. The Multi-scale Derivative image suppresses the background and highlights the sources, which will make source detection much easier. In this example, the Multi-scale Derivative is used as a filter which enhances sources in the image while suppressing background. Alternatively, it can be viewed as a test-statistic which incorporates local information to test for peaks in the data. We can calculate a confidence superset for M 1 which can be used to create a catalog from M with False Cluster Proportion control; we call this the MSFCP procedure. We expect that the catalog created this way, which is specifically designed to draw out sources, will 1

Algorithm: Simulating a Confidence Superset for Filtered Data in Section 2.7

3.3. FCP VS. MSFCP: SIMULATION RESULTS

29

give us more detection power.

3.3

FCP vs. MSFCP: Simulation Results

To evaluate both FCP and MSFCP, we ran the procedures on images simulated to look like the Chandra data. One hundred 512 × 512 pixel images were made using code developed by Dr. Peter Freeman. The number and intensities of the sources is simulated according to Virani et al. (2006), using exposure times and a point spread function consistent with the CDFS data. Each image was then subjected to four noise conditions. In the first, no additional noise was added while in the other three, Poisson noise was added to each pixel with means one, two, and five. FCP and MSFCP were then run on each of the four hundred images such that at most 10% of the detections will be false positives with probability .95. For each run, we checked each detection made by our algorithms against the true catalog used to simulate the image. From this information, we calculate the purity of the created catalog as the proportion of detections that correspond to real objects. Table 3.1 shows the mean purity for the two algorithms across the noise conditions. Mean Purity for Simulated Data Purity FCP

Purity MSFCP

Noise 1

0.995

0.984

Noise 2

0.996

0.984

Noise 3

0.997

0.982

Noise 4

0.997

0.977

Table 3.1: Purities averaged over 100 runs for each of FCP and MSFCP. The Noise conditions are no additional noise, and Poisson noise added to each pixel with means one, two, and five. The purities indicate that we are not making many false detections; on average less than one half of one percent of FCP’s detections are false while MSFCP does slightly worse with between one and a half to two percent of its detections being false. This indicates the algorithms are doing well at not making false detections, but we can also explicitly check if our coverage guarantee is met. In Table 3.2 we examine the number of times the percentage of false clusters exceed the pre-specified 10%

CHAPTER 3. ENHANCING FALSE CLUSTER PROPORTION METHODS 30 FOR ASTRONOMICAL SOURCE DETECTION FCP

MSFCP

Noise 1

0

0

Noise 2

0

1

Noise 3

0

1

Noise 4

1

1

Table 3.2: The number of times the percentage of falsely detected clusters exceeded our 10% bound over 100 runs for each of FCP and MSFCP. The Noise conditions are no additional noise, and Poisson noise added to each pixel with means one, two, and five. In the simulations we rarely exceed the pre-specified 10% falsely detected clusters bound – significantly less than the 5% we are willing to tolerate. This also indicates there may be room for improvement. If we can get closer to the bound, we may be able to get more power. In these simulations, our methods control the rate of false positive detections

0.10 0.00

0.05

Percent of Clusters Recovered using MSFCP

0.10 0.05 0.00

Percent of Clusters Recovered using FCP

0.15

MSFCP

0.15

FCP

0

1

2

5

Mean of Poisson Noise Added to Each Pixel

0

1

2

5

Mean of Poisson Noise Added to Each Pixel

Figure 3.3: Boxplots comparing the percent of real clusters detected for FCP and MSFCP across different noise conditions. as advertised. But for them to be useful they also need to be able to detect the sources we want to detect. Figure 3.3 shows boxplots of the percentage of sources recovered by noise type and by detection procedure. As expected MSFCP has better detection power across all noise conditions. Detection power is directly related to the source intensity –

31

0.6 0.4

Probability of Detection

0.8

1.0

3.4. MSFCP ON CHANDRA DATA

0.0

0.2

FCP MSFCP

0

50

100

150

200

Source Intensity

Figure 3.4: Logistic regressions for the high noise conditions show that MSFCP is better at detecting faint sources than FCP brighter sources are easier to detect than faint ones. Thus we can model detection power as a function of source intensity. Figure 3.4 shows logistic regressions for FCP and MSFCP from the highest noise condition data. We see that MSFCP is much better at detecting fainter sources than FCP, for instance we predict that 95% of sources with source intensity 67.3 will be detected by MSFCP while FCP would only detect about 27% of these sources.

3.4

MSFCP on Chandra Data

Using the Multi-scale Derivative filter on the CDFS image should enhance the clusters, making them stand out more from the background. After performing the smooth-then-root transformation as described previously, we compute the Multi-scale Derivative image. We then ran FCP on the Multi-scale Derivative image M , using a confidence superset that was obtained using the algorithm for filtered data outlined in Section 2.7. We refer to this procedure as the MSFCP procedure. Using MSFCP, with the same parameters as before, we detect 24 of the 26 sources detected in the main catalog. We do not make any detections outside of the catalog. Since the detections in the catalog have been replicated in follow-up observations this suggests that we are not making any false detections while still detecting almost all of the real objects.

CHAPTER 3. ENHANCING FALSE CLUSTER PROPORTION METHODS 32 FOR ASTRONOMICAL SOURCE DETECTION

Figure 3.5: CDFS data after applying MSFCP procedure. Black indicates sources detected by FCP and the circles are detections in both the Main Catalog and the Secondary Catalogs from GI

3.4. MSFCP ON CHANDRA DATA

33

Without having to simulate images made to look like the telescope images, make any assumptions about the underlying astronomy, and without conducting follow-up observations, we can still make the statement that with probability .95 less than 10% of our detections will be false. The other algorithms cannot make such an explicit guarantee of a pure sample. The detection strategy used in Giacconi et al. (2002), which we refer to as GI, is designed for a small patch of sky with the ability to follow-up all potential detections. In this scenario, it makes sense to use a detection scheme that casts a wide net. Because we can go back and verify or reject each potential detection we will tolerate a larger number of spurious sources initially in exchange for the ability to detect some fainter sources. This wide net strategy is unfair to our MSFCP approach which was run to keep the false detection rate less than 10% without the benefit of additional data. To mimic this wide net approach we can relax our parameters. If we allow for the proportion of false sources to grow to 20% we detect 26 sources, two of which are new detections outside of the main catalog. Both these new sources have been subsequently detected using other datasets (Luo et al., 2008) so we are confident that they are both real.

Figure 3.6: Left: Raw photon counts from the telescope. Center: Black regions are the three sources detected by MSFCP using a tolerance of 20%. The detection in the middle and upper left are also detected and verified in GI. Left: The optical counterpart observation for the source in the center from Giacconi et al. (2002). The contour lines indicate X-ray activity at 3,5,10, and 20 σ above the local background. We can clearly see a source that overlaps with our detection in the lower left, suggesting it is not a spurious detection. In Figure 3.6, we show the original X-ray photon counts for a small region of sky containing one of the newly detected sources. Visually we see three candidates and using a

CHAPTER 3. ENHANCING FALSE CLUSTER PROPORTION METHODS 34 FOR ASTRONOMICAL SOURCE DETECTION False Cluster Proportion tolerance of 20% we detect all three. In the final panel of Figure 3.6 we see the optical follow up which clearly shows an object at the location of our new detection, further confirming that this new detection is in fact a real source that was missed by the analysis in GI. Our procedure has not only captured almost all of the detections from the previous analysis but with rigorous error control, it has also found two new sources that were missed by the astronomers. Comparison of Detection Procedures on CDFS Method

False Cluster Proportion

% of Main Catalog

# of New Verified

Tolerance (c)

in GI recovered

Detections outside of GI

PP

10%

65.3%

0

MSFCP

10%

92.3%

0

MSFCP

20%

92.3%

2

Table 3.3: Summary of Detection Routines on CDFS We expect the proportion of false sources one is willing to tolerate to be a parameter that astronomers will have to determine to suit the problem at hand. For instance, the “wide-net” approach works adequately for small data sets, but newer telescopes, like LSST, will collect such a large amount of data that following up each potential detection will be expensive if it is even possible. In this scenario having a detection procedure that controls for the proportion of false detection automatically in one pass will be vitally important.

3.5

Accounting for Varying Exposure Times

Many astronomical images are actually composites of several different images which are taken at different times and may be taken at slightly different angles or orientations. This leads to certain pixels having more exposure time than others. This requires us to modify our basic model from Equation 2.1 to Yij distributed as Poissonhwij (λ1,ij + λ0,ij )i,

(3.3)

where wij denotes the exposure time for pixel ij normalized to [0, 1]. When we smooth the data we need to account for the varying exposure times. This is especially important if parts of the image have zero exposure time. In that case, we call pixels with zero exposure

35

3.5. ACCOUNTING FOR VARYING EXPOSURE TIMES

time to be structural zero pixels, because unlike the other pixels they are not random. We do not want to smooth structural zero pixels in with pixels that have been exposed. To insure that this does not happen we include the exposure times in our smoothing operation. For notational convenience we reindex the pixels such that each pixel only has one index instead of the previously used two which referred to the row and column positions. ∑n j=1 Y˜i = ∑n

wj Yj K( i−j h )

(3.4)

i−j j=1 wj K( h )

We call pixel Y˜i the exposure-time weighted smooth version of pixel Yi as it smooths the data while respecting the exposure times. This also leads to a new version of the Multiscale derivative transform, which takes into account the exposure time. This is necessary not only to insure that structural zero pixels do not cause problems but also because sharp changes in exposure time can create artificial gradients in the data which could get picked up and mistaken for sources by peak-finding algorithms like the Multi-scale derivative. The Exposure-time corrected Multi-scale derivative(ETMSD) is constructed from taking the derivative of the exposure-time weighted smooth image using a Gaussian kernel. ) ) ( −2 ||i − j||2 i−j = wj Yj + K h h3 h j=1 j=1 ( ) ( ) ) ( n n ∑ d ∑ i−j −2 ||i − j||2 i−j wj K = wj + K dh h h h3 h

d ∑ wj Yj K dh n

(

i−j h

)

j=1

( Letting Ai−j,h =

Fh (i) = =

d ˜ Yi dh ∑n

−2 h

(

n ∑

j=1

+

||i−j||2 h3

j=1 wj Yj Ai−j,h K

) and applying the quotient rule we get

(

i−j h

)∑

n j=1 wj K

(

i−j h

(∑ n

)



∑n

j=1 wj K

j=1 wj Yj K ( ))2

(

i−j h

)∑

n j=1 wj Ai−j,h K

i−j h

This tells us how to estimate the derivative with respect to the scale h for the exposure-time weighted smooth image. To simplify computations we break F˜h (i) into four blocks; each of which can be computed quickly using Fourier transformations:

(

i−j h

)

CHAPTER 3. ENHANCING FALSE CLUSTER PROPORTION METHODS 36 FOR ASTRONOMICAL SOURCE DETECTION

z n ∑ Fh (i) =

B1

}|

(

wj Yj Ai−j,h K

j=1

i−j h

{z )∑ n

B2

B3

B4

}| { z }| {z }| { ) ∑ )∑ ) ( ( ( n n i−j i−j i−j − wj K wj Yj K wj Ai−j,h K h h h j=1 j=1 j=1  2 ( ) n ∑ i−j   wj K h j=1 | {z } B2

(3.5) Let f (.), f −1 (.) denote the Fourier transform and its inverse respectively while × denotes point-wise multiplication. Additionally, let Ah and Kh be the appropriately centered matrix versions of the Ai−j,h and the Gaussian kernel at scale h with w and Y the matrices containing the exposure times and photon counts respectively. Then B1 (h) = f −1 (f (w × Y ) × f (Ah × Kh ))

(3.6)

B2 (h) = f −1 (f (w) × f (Kh ))

(3.7)

B3 (h) = f −1 (f (w × Y ) × f (Kh ))

(3.8)

B4 (h) = f −1 (f (w) × f (Ah × Kh ))

(3.9) (3.10)

so,

  Fh (i) =

B1 (h)×B2 (h)−B3 (h)×B4 (h) B2 (h)2

 0

: B2 (h)i > 0

(3.11)

: B2 (h)i = 0

Equation 3.11 should be significantly faster to compute using Fast Fourier transforms (e.g. Frigo & Johnson, 2005). The estimate is set to zero when the denominator, which is a smooth estimate of the exposure time, is zero. Then, as with the Multi-scale derivative transform, the ETMSD image, E, is calculated by taking the pixelwise minimum of the derivative estimates over all scales. E(i) = min Fh (i) h

(3.12)

The ETMSD image should have large negative values at the locations where there are sharp peaks in the data but will suppress peaks that are artificially created by sharp changes in exposure times.

3.6. CHANDRA 2MS EXPOSURE TIME APPLICATION

3.6

37

Chandra 2Ms Exposure Time Application

The Chandra Deep-field South 2Ms dataset (Luo et al., 2008) doubles the exposure time of the 1Ms dataset we previously examined. For this example we look at a 1024×1024 pixel slice of the data (Figure 3.7) which has significant variation in exposure time which we can see in the exposure map (Figure 3.8).

Figure 3.7: Raw photon counts for 1024 pixel by 1024 pixel patch of the Chandra Deep Field South 2Ms dataset.

CHAPTER 3. ENHANCING FALSE CLUSTER PROPORTION METHODS 38 FOR ASTRONOMICAL SOURCE DETECTION

Figure 3.8: Exposure times (normalized to [0,1]) 1024 pixel by 1024 pixel patch of the Chandra Deep Field South 2Ms dataset.

3.6. CHANDRA 2MS EXPOSURE TIME APPLICATION

39

A significant section of this dataset has zero exposure time meaning that those regions are structural zero pixels. Additionally, there is an exposure time gradient in the image that goes from zero exposure time up to the maximum exposure time. Thus we model each pixel as

Yij distributed as Poissonhwij (λ1,ij + λ0 )i,

(3.13)

where wij is the exposure time normalized to [0, 1] for each pixel, λ0 is the mean background level and λ1,ij is the contribution, if any, from sources. Since we know the exposure time for each pixel, we can divide each pixel with non-zero exposure by its exposure time so that for the non structural zero pixels

[

Yij E wij

] = λ1,ij + λ0

(3.14)

we call this operation flat-fielding the image since the expected value for each exposed background pixel is now λ0 . The flat-fielding operation also changes the variance, as well as the mean, so the variance for each pixel is now

[

Yij V wij

] =

1 (λ1,ij + λ0 ) wij

(3.15)

This can be problematic as pixels with low-exposure times will have high variances. In practice, we usually zero out any pixel with exposure time less that 0.1 and treat is as a structural zero pixel. We also smooth the exposure maps just as we would smooth the data. After flat-fielding the image we can apply MSFCP using Equation (3.13) and the known exposure map to simulate a confidence superset as follows:

CHAPTER 3. ENHANCING FALSE CLUSTER PROPORTION METHODS 40 FOR ASTRONOMICAL SOURCE DETECTION Algorithm: MSFCP Confidence Superset with Exposure Times 1. For each b ∈ {1, . . . , B}, simulate an image, Y b , with the same dimensions as the data according to Equation (3.13) setting λ1,ij = 0 using the exposure times from the exposure map. 2. Perform a weighted smooth (Equation (3.4)) of each Y b with the weights corresponding to the exposure times 3. Flat-field the smoothed Y b , dividing it by the smoothed exposure time map 4. Calculate the Multi-scale derivative of the smoothed and flat-fielded Y b and record the maximum 5. Calculate r1−α , the 1 − α percentile of the maxima 6. Using the flat-fielded Multi-scale Derivative image from the data, M , calculate the level set U = {Mij : Mij ≤ r1−α }

Following the above algorithm we get the 95% confidence superset shown in Figure 3.9. We see that there are large patches, which correspond to the edge of the exposure map, that are not included in the confidence superset but are not sources. This indicates that our confidence superset is not valid since it should contain all non-source pixels with probability 0.95. This occurs because the Multi-scale derivative is looking for sharp peaks in the data, which we are implicitly assuming only occur due to sources. However, when we have sharp changes in exposure times this can also cause peaks in the data which the Multi-scale derivative is detecting and confusing with sources. The solution is to include the exposure time in the calculation using the Exposure-time corrected Multi-scale Derivative (ETMSD) image (Equation (3.12)) as shown in Figure 3.10). In the ETMSD image, we still can clearly see the sources from the background but the edges of the exposure map are not picked up as being “peaked” in the way they were when we did not account for the exposure times. We calculate the confidence superset following the MSFCP with Exposure Times Algorithm outlined earlier but replacing the Multi-scale Derivative transform with the ETMSD transform. The ETMSD 95% confidence superset is shown in Figure 3.11. The confidence superset correctly handles the variations in exposure times and produces a valid

3.6. CHANDRA 2MS EXPOSURE TIME APPLICATION

41

Figure 3.9: 95% confidence superset for 2Ms data using the MSFCP with Exposure Times algorithm. White areas are the superset and black areas are regions outside the superset. We see a string of black spots that run along the edge where the exposure time has a sharp increase from zero. These are not sources but rather an artifact that the Multi-scale derivative is picking up from the change in exposure time! We also see some black around the edges which is an artifact from the toroidal boundary conditions in the Fast Fourier algorithm and should be trimmed out of the image.

CHAPTER 3. ENHANCING FALSE CLUSTER PROPORTION METHODS 42 FOR ASTRONOMICAL SOURCE DETECTION

Figure 3.10: ETMSD image of 2Ms data. We see that the sources show up clearly but we are not picking up artifacts due to the exposure times. Although you can see the edge of the exposure map in the image the magnitudes of those pixels are in line or smaller with the other background pixels as opposed to the Multi-scale derivative images where those edges are brighter to the point that they are confused with sources.

3.7. A PROCEDURE FOR AN EXPOSURE-TIME BINNED CONFIDENCE SUPERSETS 43 confidence superset. However, it includes all but one of the sources in the superset, making it very conservative. This means we will have very poor power. The confidence superset is conservative because as can be seen in Equation (3.15) pixels with low exposure times will have high variance due to the flat-fielding. These highly variable pixels are the ones that generate the largest values which determine the confidence superset. This is demonstrated in Figure 3.12 where we see that for simulated data with no sources the maximum of the ETMSD image occurs on the edges of the exposure map. One possible solution to this problem is to bin the exposure times, calculate maxima for each bin and combine that information to create a confidence superset.

3.7

A Procedure for an Exposure-time Binned Confidence Supersets

The confidence supersets in Section 2.7 uses simulations to find the value r1−α such that P {maxi∈S0 Yi ≤ r1−α } = 1 − α. It follows that { } 1 − α = P max Yi ≤ r1−α i∈S0

(3.16)

= P {Yi ≤ r1−α ∀Yi ∈ S0 }

(3.17)

≥ P {U = {Yi : Yi ≤ r1−α } ⊇ S0 }

(3.18)

Now suppose we split our data into bins according to exposure time. Then we can follow the same procedure for each bin to get a confidence superset for that bin. We will show how to combine these binned supersets to get a confidence superset for the entire image. We denote bin i, bi for i = 1 . . . Nb and assume the bins are chosen in such a way that every pixel belongs to exactly one bin, with |bi | the number of pixels in bi Define, S0i = {Yj ∈ bi : Yj ∈ S0 }

(3.19)

the set of null pixels in bi and assume we know the value r1−αi such that } { P

max Yj ≤ r1−αi

Yj ∈S0i

= 1 − αi

(3.20)

Then, following the same argument as above, we can show the set U i = {Yj ∈ bi : Yj ≤ r1−αi }

(3.21)

CHAPTER 3. ENHANCING FALSE CLUSTER PROPORTION METHODS 44 FOR ASTRONOMICAL SOURCE DETECTION

Figure 3.11: 95% confidence superset for ETMSD image of 2Ms data in white. We see one source (circled) that is not in the superset in the upper right hand corner. All the other sources are absorbed in the confidence superset. We have however fixed the problem of changes in exposure time showing up outside of the confidence superset. So this is a valid superset albeit a conservative one.

3.7. A PROCEDURE FOR AN EXPOSURE-TIME BINNED CONFIDENCE SUPERSETS 45

Figure 3.12: The underlying black and white image is the exposure map. Over it we have plotted the location of the maxima from 100 simulated ETMSD images, where the data were simulated using the exposure map and no sources. We see that the maximum is always occurring where the exposure time is small, which corresponds to pixels with high variances in the flat-fielded image.

CHAPTER 3. ENHANCING FALSE CLUSTER PROPORTION METHODS 46 FOR ASTRONOMICAL SOURCE DETECTION is a 1 − αi confidence superset for S0i . { 1 − αi = P

} max Yj ≤ r1−αi

(3.22)

Yj ∈S0i

{ } = P Yj ≤ r1−αi ∀ Yj ∈ S0i { } ≥ P U i ⊇ S0i

(3.23) (3.24)

To construct a confidence superset for the whole image we need P {U ⊇ S0 } ≥ 1 − α. Using i b the Bonferroni inequality we can show that for appropriately chosen αi , U = ∪N i=1 U is a

1 − α confidence superset for S0 , P(U ⊇ S0 ) = 1 − P(U 6⊇ S0 )

(3.25)

i i b = 1 − P(∪N i=1 U 6⊇ S0 )

≥ 1−

Nb ∑

(3.26)

P(U i 6⊇ S0i )

(3.27)

αi

(3.28)

i=1

= 1−

Nb ∑ i=1

So for any choice of non-negative αi such that

∑Nb

i=1 αi

= α we get a 1 − α superset

for S0 by taking the union of 1 − αi confidence supersets for S0i . A reasonable choice is to weight each bin by its cumulative exposure time so that ∑ α j∈bi wj αi = ∑N ∑ b j∈bk wj k=1 Another choice for αi is

|bi |α αi = ∑N b j=1 |bj |

(3.29)

(3.30)

which weights the importance of each bin by the number of pixels in the bin. The binning allows us to treat regions of the image separately based on exposure time, specifically we can split apart pixels with low exposures from high exposures and get separate confidence supersets for each which we can then combine to get the confidence superset for the entire image.

3.7.1

Exposure-time Binned Confidence Superset Example

Without binning the exposure times we get the 95% confidence superset U = {Ej : Ej < 2453}, shown in Figure 3.11, where E is the Exposure-time corrected

3.8. CONCLUSIONS

47

Multi-scale derivative image. This confidence superset is valid but conservative. Using the binned confidence superset approach outlined in the previous section produces a better confidence superset as shown in Figure 3.13. We used three bins, one for exposure times less that 0.7, one for exposure times between 0.7 and 0.9 and one for exposure times greater than 0.9; the corresponding αi were set to .005, .015, and .03. The selection of the bins was somewhat arbitrary. We want to keep the number of bins low because as we increase the number of bins there are more αi but they still have to sum to α, meaning the binned confidence supersets become more conservative. On the other hand we want to use enough bins to best capture the variations. Further work is needed to determine the best way to select the bins. Figure 3.13 shows 5 sources that were strong enough to not be included in the confidence superset compared to one when we did not use the binning approach. The same five clusters are recovered by running FCP on the ETMSD image with the parameters set to bound the rate of false positive detections to be less than 10% with probability .95 and using the binned confidence superset.

3.8

Conclusions

We have extended the False Cluster Proportion Multiple Testing Procedure so it can now be applied to a wide variety of problems in Astronomy, as well as other fields. By moving away from theoretical approximations, like Piterbarg’s approximation, we can now calculate confidence supersets for a large variety of noise conditions. This allows us to create source catalogs with a probabilistic guarantee that they are not overly polluted with false detections without having to make any difficult to check assumptions about the data or the science behind the data. The Multi-scale Derivative procedure enhances the types of sources we typically see in a telescope image and we have shown that they can also enhance the power of False Cluster Proportion source detection algorithms. This is evidenced by simulation results as well as our analysis of the CDFS data. The catalog published in GI uses two different algorithms and then has to follow up each detection, whereas we run MSFCP and can make the statement that with probability .95 less than 10% of our detections are false, and still get virtually the same catalog. When we allow 20% of our detections to be false we make two new detections, both of which we have verified are real sources that were missed

CHAPTER 3. ENHANCING FALSE CLUSTER PROPORTION METHODS 48 FOR ASTRONOMICAL SOURCE DETECTION

Figure 3.13: 95% confidence superset for ETMSD image of 2Ms data in white. This confidence superset was created using three bins for exposure times. We see five sources that are not in the superset in the upper right hand corner compared to the one source we saw without binning.

3.8. CONCLUSIONS

49

in the original analysis. Our procedure provides rigorous error control, which is lacking in the current techniques used by astronomers. At the same time we have demonstrated that the detection power is comparable and in fact we have detected sources that they have missed. Controlling false detections without the need for expensive follow up observations will be critical as the next generation of telescopes will provide a deluge of data that will be impossible to process manually. We have also developed methods for images with varying exposure times. If certain pixels are less exposed then others it changes what the expected counts would be for that pixel and our method takes this into account. Furthermore, when the exposure time is not constant, if we do not adjust appropriately we may mistake artifacts due to the exposure time as sources. Unfortunately, this makes the testing problem more complex and as a result we lose some power in our techniques. This has motivated the binning approach in which we try to maximize our power for the well exposed regions while insulating the methods from being dominated by the low exposure regions. The methods in the past two chapters were developed for the astronomical source detection problem. However, they are not so narrow as to be only applicable to telescope images. In the next chapter we move from images that cover the night sky to images of the neural activity and show that the same statistical principles and methods developed to detect galaxy clusters and quasars can also be used to detect subtle activity in the human brain.

Chapter 4

False Cluster Proportion Methods for Functional Neuroimaging Data 4.1

Introduction

In a functional Magnetic Resonance Imaging (fMRI) experiment, a participant is placed in the Magnetic Resonance scanner and asked to perform a carefully arranged sequence of behavioral tasks while a time series of three-dimensional brain images are acquired at regular intervals. Concentrated neural activity produced by performing the tasks induces detectable but subtle changes in the images due to a blood-flow response in the brain which can be non-invasively measured by the scanner. A common form of analysis for fMRI data considers the time series of measurements at each location in the image and computes a statistic for that location that is designed to detect task-related changes in the signal. “Hot spots” in images of these statistics roughly correspond to areas of activity in the brain. See Genovese (2000) for more detail on this process. In the series of experiments examined in this chapter, the goal is to verify that a new method for acquiring high spatial resolution data can be used to detect previously undetectable fine-scale changes in brain activity without being overwhelmed by noise. Towards this goal, the study focuses on V1, the primary area responsible for visual processing in the brain. V1 was chosen because it is a well understood region of the brain and can be easily manipulated by presenting visual stimuli to the participant. In this series of experiments, the participant is asked to visually fixate on a spot at the center of the visual field while two complementary sets of annular rings flash alternately at a fixed frequency (Figure 4.1). The participants’ brains are then imaged using a new technique that provides much higher spatial resolution than is typical for fMRI studies, without sacrificing temporal resolution. Since these images are recorded in three dimensions we refer to the units as voxels, which 51

52

CHAPTER 4. FALSE CLUSTER PROPORTION METHODS FOR FUNCTIONAL NEUROIMAGING DATA

can be thought of as three dimensional pixels. The subset of voxels that intersect the cortical surface are then projected down to the two-dimensional surface of the brain for analysis. For this experiment, the three dimensional images are 256 × 64 × 28 voxels and 28,830 of those voxels are projected onto the cortical surface. Whereas typical fMRI images have a resolution of about 3mm, using this new technique our data have a resolution of 0.75mm. The higher resolution allows us to see finer-scale detail in brain activity. In the three experiments presented in this chapter, we are trying to detect bands of activity that are 5mm, 2.5mm and 1.6mm wide. With a resolution of 3mm, we could hopefully detect the 5mm band but the 2.5 and 1.6mm bands would not be resolvable because the response from a voxel element containing the band would essentially be the average over the band and the other signals in that voxel. One concern in using the higher-resolution data is that noise will obscure the activity we are attempting to observe. In this chapter we develop a technique to identify regions of activity in the high-resolution data with rigorous bounds on the rate of false positive regions caused by noise. We show that we can in fact detect the 2.5mm bands and even detect some structure for the 1.66mm bands. In V1, voxels that respond to the stimulus should have an fMRI signal that varies like a sinusoid, as the corresponding location in the visual field turns on and off. Furthermore, locations that respond to the first set of rings should have sinusoidal signals 180◦ out of phase to locations that respond to the second set of rings, so that the active regions are further split into two distinct phase classes, which corredspond to the two different stimuli rings. Our goal is to identify the regions in V1 that activate in response to each set of stimuli. This will appear as a series of bands across the visual cortex with alternating bands out of phase with each other. We want to identify coherent regions of neural activity so our criterion for differentiating regions should be based upon spatially localized regions, not the individual voxels themselves (e.g. Heller et al., 2006). To this end, we adapt a procedure that controls the rate of false positive spatial regions which has been very successful in identifying astronomical objects in telescope images. In the Methods section we first detail a classification procedure for classifying voxels by phase. We need to know which stimulus, if any, each voxel is responding to and this information is encoded in the phase. After classifying the voxels, we need some criteria to declare which regions are active. For this purpose, we introduce a modified version of the False Cluster Proportion algorithm. The modified version handles points in space that are

4.2. METHODS

53

Figure 4.1: Graphical Summary of fMRI experiment: A: Checkerboard stimulus presented to the subject, the rings alternate at a rate of 1/24 seconds and are calibrated using previous experiments to produce bands of cortical activation of a pre-specified width. B: Sample response to the stimuli overlaid on an anatomical scan of the brain. C: Data are projected onto a two-dimensional representation of the cortical surface. D: Fourier representation of a sample time series. The large peak corresponds to the stimulus frequency and can be detected using an F-test not necessarily on a regular grid and also can account for different classes of clusters – like the two classes of voxels corresponding to the two different stimuli. To use the modified False Cluster Proportion algorithm requires the construction of a confidence superset, and we present two methods for constructing such a set. Finally, we present results for three different datasets in which the bands become increasingly smaller and more difficult to detect.

4.2

Methods

To detect the periodicity at each of the n voxels, we use a variant of Fisher’s F-test (Brockwell & Davis, 1998) for each time series to determine if that location is responsive to the stimulus. For each location, we also have the response phase in radians which, for locations that respond to the stimulus, we expect to cluster corresponding to the differing sets of rings. For voxel Vi we denote the p-value from the F-test as pi ∈ [0, 1] and the phase as zi ∈ [0, 2π). Our task is to separate regions of voxels into three groups: null locations for which there is no response to the stimulus, locations that respond to the first set of rings and those that respond to the second set of rings. To do this we first classify each voxel into one of two groups based on phase. Next, we declare regions of voxels of the same phase to be active according to a criterion that controls the rate of false positive regions.

CHAPTER 4. FALSE CLUSTER PROPORTION METHODS FOR FUNCTIONAL NEUROIMAGING DATA

54

4.2.1

Classification

We first classify the voxels into two groups according to the two phase classes. The classifier is initialized using a small subset of voxels with the smallest p-values (in this case we used 256 voxels). The small p-values indicate a strong response to a stimulus and should tightly cluster into the two phase groups corresponding to the two stimuli. A 2-means clustering algorithm is then applied to the subset which separates out the two groups and estimates the mean phase in radians for each group which we denote G1 and G2 . Additionally, the number of voxels in Groups 1 and 2 are denoted n1 and n2 respectively. The classifier is derived based on the diameter drawn at the angle (n1 G1 + n2 (G2 + π))/(n1 + n2 ) −

π 2

such

that points which lie on one side of the diameter are all assigned to the same class and those on the opposite side of the diameter are all assigned to the other class. This classifier assumes that the centers of the two phase classes are exactly 180◦ (π radians) apart, so that G2 + π is an estimate of G1 . The classifier is then drawn perpendicular to our best estimate of G1 . An example is shown in Figure 4.2.

4.2.2

False Cluster Proportion Algorithm for fMRI

False Cluster Proportion algorithms (Perone Pacifico et al. (2004), Friedenberg & Genovese (2009)) are procedures for detecting clustered structure in data with rigorous statistical bounds on the rate of falsely detected clusters. These procedures examine candidate regions and compare them to a confidence superset in such a way that we can accurately recover the clusters while controlling the rate of false positives. In this section we introduce a modified False Cluster Proportion Algorithm specifically designed for this type of fMRI problem. We treat our data as a realization of a Random Field X = {X(s) : s ∈ S} on a set S with a non-negative mean function µ(s). µ(s) will only be non-zero if there is activity at that location. We want to differentiate the regions S0 = {s ∈ S : µ(s) = 0} and S1 = {s ∈ S : µ(s) > 0}. On the random field we define a 1 − α confidence superset, which we denote U , to be a set that satisfies: P (U ⊇ S0 ) ≥ 1 − α

(4.1)

Let Lt = {Vi : pi ≤ t}, that is Lt is the set of all voxels whose p-value is less than the threshold t. Lt can be decomposed into its connected components (clusters). A connected component at the level t is a maximal subset of voxels belonging to Lt that are all of the same

55

−1.0

−0.5

0.0

0.5

1.0

4.2. METHODS

Classification Boundary

−1.0

−0.5

0.0

0.5

1.0

Figure 4.2: The phases of the 256 voxels with the smallest p-values are plotted on the unit circle. The two large circles denote the means of the two groups after running the 2-means clustering algorithm. The blue diameter is drawn perpendicular to the diameter that most closely connects the two group means, calculated at the angle (n1 G1 + n2 (G2 + π))/(n1 + n2 ) − π2 . This line is the classification boundary for the phase classes. phase class and are spatially coherent. For the purposes of this analysis, a set of voxels are considered spatially coherent if for any two voxels in the subset there exists a path from one to the other using only voxels which are also in the subset and in which the euclidean distance between any two voxels in the path is less than the application dependent parameter d. We can decompose a level set Lt into the set of connected components Ct = {C1,t , C2,t , . . . , Ckt ,t } which are the kt clusters at threshold level t. An individual cluster C is declared false if #Vi ∈ (C ∩ S0 ) ≥ #Vi ∈ C

(4.2)

 is a tolerance parameter chosen by the user. Our goal is to find a threshold t whose associated clusters, Ct , are not overly polluted with false clusters. Mathematically, we want to control the true false cluster proportion, Ξ(t) Ξ(t) =

#{1 ≤ j ≤ kt :

#Vi ∈(Cj ∩S0 ) } #Vi ∈Cj

kt

≥ }

(4.3)

56

CHAPTER 4. FALSE CLUSTER PROPORTION METHODS FOR FUNCTIONAL NEUROIMAGING DATA

The true false cluster proportion can be bounded by the false cluster proportion envelope, Ξ(t)

Ξ(t) =

#{1 ≤ j ≤ kt :

#Vi ∈(Cj ∩U ) #Vi ∈Cj

≥ }

kt

(4.4)

where U satisfies Equation 4.1. Then, from Perone Pacifico et al. (2004), we get

P (Ξ(t) ≤ Ξ(t)∀t) ≥ 1 − α

(4.5)

Equation 4.5 tells us that we can recover clusters with a probabilistic bound on the rate of false clusters. To insure the rate of false clusters is less than γ with probability 1 − α, one needs to find the value tγ such that Ξ(tγ ) = γ This motivates the following algorithm to recover clusters of activity with the probabilistic guarantee that less than γ ∗ 100% of the clusters are false with probability 1-α:

Algorithm: FCP for fMRI 1. Calculate a 1 − α confidence superset U for the data and set tl = 0 and tu = 1 2. Choose a t ∈ (tl , tu ) 3. Calculate Ξ(t) using U from Step 1 4. If Ξ(t) > γ set tl = t. Else set tu = t 5. Iterate steps 2-4 until t converges to tγ 6. The set of ktγ clusters will have the property that the true proportion of false clusters will be less than γ with probability 1 − α

The result is a simple threshold t that gives a powerful and tunable bound on the rate of falsely detected regions. At this time, the algorithm and probabilistic guarantee do not cover voxels that are misclassified by the classification algorithm. However the voxels which will be declared active will all have small p-values which correspond to strong stimulus responses and should therefore be easy to classify.

57

4.2. METHODS

4.2.3

Calculating U

We call U a 1 − α confidence superset for S0 if it has the following property: P (U ⊇ S0 ) ≥ 1 − α

(4.6)

A simple way to calculate a confidence superset is noting that under the null hypothesis the p-values will be distributed uniformly. Thus, if we have n voxels, we can directly calculate what we would expect to see under the null. Explicitly, the set 1

U = {Vi : pi > 1 − (1 − α) n }

(4.7)

will satisfy Equation 4.1 although it will be conservative in the sense that the set will contain more voxels than necessary. The conservativeness will lessen the power of the False Cluster Proportion Algorithm but makes the superset trivial to calculate. Alternatively, U can be calculated via simulation, as first proposed in Friedenberg & Genovese (2009). This is desirable because it allows us to still calculate confidence supersets under a wide variety of conditions including various data transformations. In this paper we examine a transformation which uses local information to boost detection power.

4.2.4

Smoothing the p-values and calculating the corresponding Confidence Superset

Instead of examining the p-value directly at voxel Vi we can examine the median of pi together with the p-values of its nearest neighbors. Formally, pei = Median(pi , pi,1 , . . . , pi,k )

(4.8)

where pi,j is the p-value for voxel Vi ’s jth nearest neighbor voxel. This effectively smooths the p-values suppressing spurious low p-values but preserving regions with swaths of low p-values, like those we would expect to see at an active location in the brain. To calculate e , for these smoothed p-values we can use the following algorithm. a confidence superset, U

CHAPTER 4. FALSE CLUSTER PROPORTION METHODS FOR FUNCTIONAL NEUROIMAGING DATA

58

Algorithm: Simulating the Confidence Superset for Smoothed p-values 1. For each b ∈ {1, . . . , B}, simulate P b = {pb1 , . . . , pbn } where each pbi ∼ Uniform(0, 1) and n is the number of voxels in the data fb = {peb , . . . , pf b } following equation 4.8 using the p-values 2. For each P b calculate P n 1 from P b and the nearest neighbors from the original data fb to create 3. Record the smallest value in each set P g g fb )} 1 ,...,p B :p b = min(P g = {pg P (1)

(1)

(1)

(1)

g b g 4. Set pf α to the α percentile value of P(1) so that α*100% of the p(1) are less than than pf α e = {Vi : pei ≥ pf 5. Set U α } forming a confidence superset by thresholding the smoothed p-values from the original data e can be used as a superset for data where we substitute the smoothed pei for pi . This Then U smoothed False Cluster Proportion algorithm should have better detection power because the smoothing operation allows us to make decisions using local information instead of only individual voxels.

4.3

Results

We apply the False Cluster Proportion methods both to the original p-values using U calculated with Equation 4.7 and also using the smoothed p-values calculated using Equation e calculated using the algorithm above. Henceforth, we refer to these two methods 4.8 with U as the Base and Smooth methods respectively. In both methods we set d, the parameter that dictates the maximum intervoxel distance in a connected component to be 7,  was set to 0.9 and for the smooth method we used k = 3 nearest neighbors to smooth the p-values. The results are not significantly altered if these parameters are modified within reason. Figure 4.3 shows both methods on data where the rings are calibrated such that the bands should be 5mm wide. In both images we can clearly see the banded structure with 10 clear bands of activity. Figure 4.4 shows the results on data where the bands should be 2.5mm wide. As expected, these bands are more difficult to see. Using the base method we

4.4. DISCUSSION

59

see parts of several bands but none stretch completely across the brain like in the previous example. The smooth method does however detect several bands that stretch across the brain although they are sparse in regions which are not as heavily sampled. Here the benefit of smoothing the data is apparent as the smoothed version presents a much clearer picture of the brain activity. Finally, in Figure 4.5 results are presented for data where the rings are set to be 1.66mm wide. The Base method does not detect any clear bands whereas the smooth method detects one band as well as what appears to be pieces of other bands.

4.4

Discussion

In this chapter False Cluster Proportion algorithms were adapted to handle both points that do not lie on a regular grid and multiple classes of clusters. With these modifications we were able to analyze a series of fMRI experiments, characterizing regions of activity in the brain while at the same time insuring that with high probability the regions were not false positives. Although the methods here were developed for fMRI they can be applied to a much broader class of problems in which we have points in space and want to make inferences about clustered structure in the data. Examples include, but are not limited to, satellite imagery, remote sensing data, and other types of medical imaging data. Future work on this problem involves studying the p-value smoothing and determining if there are better ways to smooth the data and the best way to choose the amount of smoothing. Additionally, we have considered smoothing not only the p-values but also the phases. In the same way that using local information can inform us about the activity of a single voxel it could also help to inform the classification. A related, but more complex issue, is including the uncertainty in classifications into the False Cluster Proportion framework. Ideally we would like to include misclassification errors as part of the error rate that we control, even though on the data in this chapter we do not believe these errors to have much if any effect on the results. Finally, we would like to thank Drs. David Heeger and Eli Merriam at NYU for sharing this exciting dataset with us.

60

CHAPTER 4. FALSE CLUSTER PROPORTION METHODS FOR FUNCTIONAL NEUROIMAGING DATA

Figure 4.3: 5mm Data: Base method on the top Smooth method on the bottom

4.4. DISCUSSION

Figure 4.4: 2.5mm Data: Base method on the top Smooth method on the bottom

61

62

CHAPTER 4. FALSE CLUSTER PROPORTION METHODS FOR FUNCTIONAL NEUROIMAGING DATA

Figure 4.5: 1.6mm Data: Base method on the top Smooth method on the bottom

Chapter 5

Using Diffusion Maps to Detect Cluster Structure in Data 5.1

Introduction

Clustering methods seek to partition observations into k clusters where observations in the same cluster are “similar” and observations in different clusters are “not similar”. The exact definition of similarity (or dissimilarity) can differ depending both on the specific problem and on the method being used. In this chapter, we discuss Diffusion Mapping, a data transformation that can simplify complicated cluster structures and potentially avoid many of the problems encountered by standard clustering algorithms. First, we briefly review some of the common Euclidean-distance based approaches to clustering. Perhaps the most popular clustering algorithm is k-means, which seeks to minimize the within cluster sum of squares. K-means tends to select clusters that are spherical and of approximately the same size. K-means also requires the user to choose k, the number of clusters, in advance. Another approach is Hierarchical Linkage Clustering in which groups of observations are linked, in order of “closeness”, to construct a hierarchical group structure. This tree structure can be represented by a dendrogram; clusters are extracted by pruning the dendrogram. The type of linkage depends on the measure of “closeness’ – two common choices are single linkage and Ward’s. Single linkage dendrograms are constructed using an intergroup distance measured by the shortest distance between a pair of observations, one observation in each group. Ward’s intergroup distance is based on the distance between cluster centers and is weighted by the size of the clusters. One problem with hierarchical clustering, especially single linkage, is that the results can be heavily influenced by a few well-placed observations. However, hierarchical methods are often better at recovering non-linear and non-spherical structure in the data. Additionally, once we calculate the 63

64

CHAPTER 5. USING DIFFUSION MAPS TO DETECT CLUSTER STRUCTURE IN DATA

dendrogram, it is trivial to extract clusters for any k without having to recompute the dendrogram. For more details, see Mardia et al. (1979).

Spectral clustering methods (von Luxburg, 2007) take a different approach. In spectral clustering, a fictive Markov random walk is defined over the observations. The probability of stepping from one observation to another is a function of a problem-dependent similarity matrix. Roughly, we form the transition matrix of the Markov random walk such that the probability of walking from one observation to another is high if the two observations are similar and either very low or zero if they are not. The observations are then projected onto the eigenvectors of the transition matrix and a clustering algorithm, often k-means, is used on the projected data. Spectral clustering has the advantage that it can detect clusters of different shapes and sizes and can also tolerate some outliers.

In this chapter we focus on a specific variant of spectral connectivity analysis called diffusion maps and show that powerful spectral clustering methods can be built around diffusion maps. As we will show, diffusion maps can effectively reduce the dimensionality while representing the connectivity of the dataset – including the cluster structure. Dimension reduction is vital in high-dimensional clustering problems where the structure may be complicated and visualization is difficult. We then explore and compare the performance of different methods for extracting the clusters in the reduced space. Recent work has proposed using k-means in the reduced space; however, we have found that this choice may not be optimal for all situations. We present several examples, including a series of datasets that cause catastrophic failures for standard clustering algorithms. We will also examine the glass identification dataset in which we are trying to characterize shards of glass for the purpose of crime scene investigation. We demonstrate the importance of selecting good tuning parameters and suggest two methods to do so. First, we introduce self-tuning diffusion maps which incorporate local density into the diffusion map, eliminating the need to choose a global tuning parameter. Second, we explore methods for comparing different choices of tuning parameters in an effort to make a reasonable choice across a grid of possible values. We demonstrate these methods on several different datasets and compare the results.

65

5.2. DIFFUSION MAPS

5.2 5.2.1

Diffusion Maps Overview of Diffusion Maps

In this section we provide a brief overview of diffusion maps, which have been successfully used for tasks such as data parameterization (e.g. Lafon & Lee (2006)), regression (e.g. Richards et al. (2009b)), and high-dimensional density estimation (e.g. Buchman et al. (2009)). Diffusion maps aim to measure the “connectivity” of a dataset and project the data into diffusion space, in which the Euclidean distance between two points is small if the points are highly connected in the original feature space and large otherwise. Assume we have n observations each with p attributes. We define a weighted graph on the data where each observation is a node and the edge between two observations x and y is defined as w(x, y) = exp(−d(x, y)2 /) where d(x, y) is the application-dependent dissimilarity between x and y, often just Euclidean distance. Next, we construct a Markov random walk (Ross, 2006) over the weighted graph where the transition probability of going from x to y in one∑ step is p1 (x, y) = w(x, y)/ z w(x, z). From this, we then construct the transition matrix P where Pij = p1 (i, j). The transition probabilities will be close to zero unless the two points are similar. The tuning parameter  controls how quickly the probabilities go to zero and needs to be selected either automatically (e.g. cross-validation) or by the user. A proposed default (Richards, 2009) is to take the median distance to the kth nearest neighbor in the dataset where k is a small fraction of n, usually around 1%. Since P is the transition matrix for a Markov random walk, it is straightforward to calculate the transition probability after t steps using entries from the matrix P t , we can also calculate the stationary distribution of the random walk, denoted φ0 (.). We use these quantities to define the diffusion distance between two points at the scale t: Dt2 (x, y) =

∑ (pt (x, z) − pt (y, z))2 z

φ0 (z)

(5.1)

This distance calls two points close if their t-step conditional distributions are close. The diffusion map at scale t projects the data into m dimensional diffusion space such that the Euclidean distance between two points in diffusion space approximates diffusion distance between the points in the original coordinate system. The mapping is defined as follows: Ψt : x → [λt1 ψ1 (x), λt2 ψ2 (x), . . . , λtm ψm (x)]

(5.2)

CHAPTER 5. USING DIFFUSION MAPS TO DETECT CLUSTER STRUCTURE IN DATA

66

with λj and ψj the eigenvalues and right eigenvectors respectively of P . In this paper we use the multi-scale diffusion distance as defined in Richards et al. (2009a). The multi-scale diffusion map simultaneously considers all possible paths between data points across all time scales t making it more robust to structure at different scales. To do this we replace ∑ t λtj with ∞ t=1 λj = λj /(1 − λj ) which we call the jth eigenmultiplier. Additionally, using the multi-scale diffusion map we eliminate having to choose t. We still need to choose m, the number of diffusion dimensions. Where appropriate, m is chosen using cross-validation or some other measure of risk. In the absence of such a measure, m can be chosen by examining the drop-off in the eigenmultipliers, similar to choosing the number of principal components based on a scree plot. For more details see Lafon & Lee (2006) and Coifman & Lafon (2006).

5.2.2

Clustering in Diffusion Coordinates

Diffusion maps are often used as a dimensionality reduction technique to project complicated high-dimensional data onto a lower dimensional manifold that preserves most of the structure in the data. In this sense, standard clustering algorithms can be applied to the diffusion coordinates in the same way one might apply them to a principal components representation of the data. Additionally, since Euclidean distances in diffusion space approximate the diffusion distances between observations in the original feature space, projecting the observations into diffusion space and employing a Euclidean distance based clustering method approximates clustering the observations using diffusion distance in the original feature space. Thus, it is straightforward to apply any number of clustering algorithms on the diffusion coordinates to get a diffusion-distance based clustering algorithm which operates on the reduced dimensional diffusion space – we call this Diffusion clustering. The benefits of dimensionality reduction are obvious and as we will show, diffusion distance is a robust, non-linear measure that preserves the intrinsic structure of the data – which makes it an ideal choice for clustering problems. Diffusion clustering is closely related to the family of Spectral clustering algorithms (Nadler et al., 2005). Spectral clustering algorithms form the same transition matrix as in a diffusion map but instead of weighting each eigenvector according to its eigenvalue, there are different methods of normalizing the eigenvectors, which differ depending on the exact method. A clustering algorithm, typically k-means, is then run on the rows of the matrix

5.2. DIFFUSION MAPS

67

formed by the eigenvectors. For an overview of spectral clustering methods see von Luxburg (2007). For comparison purposes, we will focus on the method outlined in Ng et al. (2001) – one of the most popular and heavily cited spectral clustering methods. For convenience, we will refer to this method as the Ng method; it is as follows. To find k clusters, take the first k eigenvectors from the matrix P to form the matrix X = [ψ1 , ...ψk ] ∈ Rn×k . Each row of X is normalized to have unit length and clustering is performed treating the normalized rows of X as the observations. Typically k-means is used as the clustering algorithm but other methods can be used.

The two biggest differences between the Ng et al. (2001) algorithm and clustering in diffusion space are the weighting of the eigenvectors and the number of eigenvectors selected. In Diffusion clustering the eigenvectors are weighted in such a way that the diffusion distance between points is represented by Euclidean distance in diffusion space. The weighting also insures that the first eigenvector gets the most weight and that as the number of eigenvectors used, m, increases the weight on each additional eigenvector decreases towards zero. The eigenvectors using the Ng method are not guaranteed to drop off in the same way. This can be problematic if we use more eigenvectors than needed because the extra eigenvectors may not contain any clustering information and could adversely affect the clustering solution. With diffusion maps this is less of a worry because the weighting shrinks the less informative eigenvectors towards zero.

This leads to the important question of how to choose the number of eigenvectors m. The Ng method sets m equal to k, the number or clusters – which itself is usually not known a priori. Under specific conditions it can be shown that this is the correct choice but, as we will show, this will not always lead to the best results. If too few eigenvectors are used important information can be lost. If too many are used, eigenvectors that are essentially noise will be introduced into the problem. It has been shown (e.g. Raftery & Dean, 2006), that additional noise variables like these can degrade clustering solutions. In contrast to the Ng method, Diffusion clustering does not force m to be equal to k. As we will show with examples, the choice of m is important and we will propose a method for choosing it automatically.

68

5.3

CHAPTER 5. USING DIFFUSION MAPS TO DETECT CLUSTER STRUCTURE IN DATA

Fundamental Clustering Problems Suite

The Fundamental Clustering Problems Suite (FCPS) (Ultsch, 2005) is a series of datasets designed as a benchmark to test clustering algorithms. Each dataset is small enough to be visualized easily, yet most will cause at least one of k-means, single linkage or Ward’s algorithms to catastrophically fail. For brevity, we present two example datasets in depth and follow with a brief overview of performance on the others. We first run k-means, single linkage, Ward’s linkage and an implementation of the Ng algorithm on the original data. Then we transform the data into diffusion space using the multi-scale diffusion map. Unless otherwise noted, we have used the multi-scale diffusion map with both  and m selected using the default methods in the DiffusionMap package in R (Richards, 2009). We proceed to run k-means, Ward’s and single linkage on the data in diffusion coordinates. In all cases, the clustering algorithm is given the correct number of clusters. Of course, in practice, the number of clusters is often unknown. Since we know the true cluster structure, we can quantify our performance using the Adjusted Rand Index (ARI) (Hubert & Arabie, 1985). This expectation of this index will be zero under random cluster assignments; the ARI is one if the cluster solution perfectly matches the truth.

5.3.1

Tetra Data

The Tetra dataset (Figure 5.1) consists of four clusters, each with 100 data points, in three dimensions. The clusters are separated but very close to each other which can cause problems for methods like single linkage, known for its chaining behavior. In fact, the four cluster single linkage clustering solution contains one large cluster with 397 points and three clusters with one point each, giving an ARI of zero. In diffusion space, the clusters are pulled apart and the structure is much more obvious. For the Tetra data, all three clustering algorithms run on the diffusion coordinates perfectly recovered the structure in the data with an ARI of one. Using the multi-scale diffusion map allows us to consider all possible paths between data points over all scales. This consideration prevents us from making mistakes by only looking at one path (likely why single linkage in the original feature space did so poorly).

5.3. FUNDAMENTAL CLUSTERING PROBLEMS SUITE True Groups in Original Data Coordinates

Single Linkage in Original Data Coordinates

Diffusion Coordinates

69 Single Linkage in Diffusion Space

Figure 5.1: 1.) The Tetra dataset contains four clusters each with 100 observations. 2.) Single linkage on the Tetra dataset groups all but three of the observations (denoted by filled in circles) into one group. 3.) The Tetra dataset in diffusion coordinates which represents the connectivity of the data 4.) Single linkage in diffusion coordinates perfectly recovers the cluster structure in the dataset

5.3.2

Chain Link Data

The Chain Link dataset (Figure 5.2) consists of two clusters, each containing 500 observations, in three dimensions. The two clusters are interlocking rings and as such are not linearly separable. Both k-means and Ward’s algorithms cannot separate the two clusters successfully. Both algorithms give small ARIs (0.09,0.39) and clearly miss the ring structure in the data. Conversely, in diffusion space the rings are well separated using just the first diffusion coordinate – an example where the number of eigenvectors need not be equal to the number of clusters in order to effectively cluster the data. The ring structure is a clear example of one advantage to using a connectivity-based distance like diffusion distance over Euclidean distance in clustering problems. With the structure disentangled in diffusion coordinates, all three clustering algorithms perfectly recover the true structure in the data.

5.3.3

FCPS Results Summary

In Table 5.1, we present the ARIs for nine datasets in FCPS using several different clustering approaches. They are k-means, single linkage, and Ward’s on the original data coordinates, an implementation of the Ng method (Karatzoglou et al., 2004), and then k-means, single linkage and Ward’s on the diffusion coordinates. For all methods, we assume the true number of clusters is known although again this often will not be the case in practice. Choosing the number of clusters when the true number is unknown is a dif-

70

CHAPTER 5. USING DIFFUSION MAPS TO DETECT CLUSTER STRUCTURE IN DATA

True Groups in Original Data Coordinates

k−Means in Original Data Coordinates

Diffusion Coordinates

k−Means in Diffusion Space

Figure 5.2: 1.) The Chain Link dataset consists of two interlocking rings of 500 observations each. 2.) As expected, k-means cannot separate the two clusters. 3.) The Chain Link data in diffusion coordinates. In diffusion space the two rings are pulled apart. The ring structure is present in the diffusion coordinates as well (evident upon zooming in). 4.) K-means in diffusion coordinates perfectly recovers the cluster structure in the data.

ficult problem in its own right, but we will not address it in this section. However, it should be noted that using diffusion map coordinates can improve visualization (via the reduced space and a better representation of the connectivity of the data) and better separate clusters, so choosing the number of clusters may be easier in diffusion space. Please see www.uni-marburg.de/fb12/datenbionik/data?language_sync=1 for a brief visual summary of all the datasets. Our results show that clustering in diffusion space usually gives better results than clustering in the original space. The one exception is the EngyTime dataset which is a mixture of two bivariate Gaussians – a dataset for which k-means in the original feature space should perform well. On this dataset, the first diffusion coordinate separates two outliers from the data, and the clustering algorithms put the outliers in their own cluster of two points. However, the second diffusion coordinate does a good job of separating the two true groups. Later in this chapter we will show that this result is due to a poor choice of default tuning parameters. Outliers also pose a problem in the Target where there are four clusters of only 3 observations each, but clustering in diffusion space still outperforms k-means or Wards in the original space. Interestingly, we see that on all examples single linkage in diffusion space returns more accurate clusters than the other methods in diffusion space. We hypothesize that this behavior is due to diffusion maps tending to place points along manifolds in diffusion space

71

5.3. FUNDAMENTAL CLUSTERING PROBLEMS SUITE Hepta

Lsun

Tetra

Chain

Atom

Link

Engy

Target

Time

Two

Wing

Diamonds

Nut

K-means

1.00

0.44

1.00

0.09

0.19

0.82

0.63

1.00

0.86

Single Linkage

1.00

1.00

0.00

1.00

1.00

0.00

1.00

0.00

1.00

Wards

1.00

0.37

1.00

0.39

1.00

0.75

0.64

1.00

1.00

Ng (Spectral)

0.81

1.00

0.99

1.00

1.00

NA

0.75

0.99

NA

Diffusion K-means

1.00

1.00

1.00

1.00

1.00

0.00

0.67

1.00

0.90

Diffusion Single Linkage

1.00

1.00

1.00

1.00

1.00

0.00

0.99

1.00

1.00

Diffusion Wards

1.00

1.00

1.00

1.00

1.00

0.00

0.67

1.00

0.93

Table 5.1: Comparison of ARIs for 9 datasets from the FCPS using several standard clustering algorithms in both the original data coordinates and in diffusion coordinates. An ARI of one means perfect agreement with the true cluster structure while an ARI of zero is expected for random cluster assignments. The two NAs indicate that the algorithm failed to reach a solution after several hours.

which are better negotiated by the “chaining” of single linkage as opposed to a method like k-means which prefers spherical clusters. This result is in contrast to Lafon & Lee (2006) which advocates the use of k-means in diffusion space with a theoretical argument. Further work is needed to better understand the behavioral differences among algorithms in diffusion space. However, it is clear that the choice of clustering algorithm in diffusion space has less impact on the final solution when compared to the same clustering algorithms used on the original features. The only dataset where the choice of clustering algorithm in diffusion space seemed to make a noticeable difference was the Target dataset where outliers caused problems for diffusion k-means and Ward’s but not single linkage. In contrast, there were large differences in ARIs based upon the clustering algorithm when using the original feature space for most of the datasets. The FCPS datasets demonstrate that transforming data into diffusion space can be effective for detecting and visualizing cluster structure in situations where standard methods do not perform well.

72

5.4

CHAPTER 5. USING DIFFUSION MAPS TO DETECT CLUSTER STRUCTURE IN DATA

Glass Identification Data

The glass identification dataset (Evett & Spiehler, 1987) is comprised of 214 glass samples with nine attributes each, corresponding to the refractive index and the chemical contents of each sample. The data was obtained from the UCI Machine Learning Repository (Asuncion & Newman, 2007). There are seven different types of glass – two types for building windows, two types for vehicle windows, containers, tableware and headlamps. Differentiating the types is important because glass left at a crime scene can be used as evidence if it can be properly identified. On this data, we demonstrate the difficulty in selecting the tuning parameters  and m in practice. In certain applications, for example regression or classification, these parameters can be selected to minimize a risk function like the cross-validation risk. However, in clustering problems, it is not always clear what risk function is appropriate since different clustering methods optimize different criteria. Instead, we examined a grid of  values and tracked the performance of k-means, single linkage, and Wards algorithms as a function of . We selected m using the 95% drop-off in eigenmultipliers. There is no special significance in 95% – we could have just as easily used 90% or 99%. Additionally, one could use a completely different criterion for selecting m than the eigenmultiplier drop-off. One benefit of fixing the 95% drop-off is it allowed us to isolate how the the eigenmultiplier drop-off changed with . Figure 5.3 shows the ARI for k-means, single linkage, and Ward’s clustering algorithms compared to the known true types in diffusion space for a range of . Of course, in practice we do not know the true types so we cannot choose our tuning parameters to maximize the ARI. K-means and Ward’s exhibit similar behavior as  is varied. Both show a peak around 0.75 which corresponds to a two-dimensional diffusion map and a second peak at 16 which corresponds to a three-dimensional diffusion map. In both instances either choice of  results in a higher ARI than using the same algorithm on the original nine-dimensional feature space. However, we also notice for  values between 5 and 15 the algorithms do very poorly in diffusion coordinates. Single linkage exhibits very different behavior. On the original feature space single linkage does poorly with an ARI of 0.018. In diffusion space there is a sharp peak at 0.75 but no secondary peak. The ARI at the peak is 0.18 which is nine times higher than the ARI using single linkage on the original feature space. This increase is a much bigger discrepancy between the original feature space and the diffusion space than

5.5. AUTOMATIC PARAMETER SELECTION FOR DIFFUSION CLUSTERING

ARI

0.10 0.00

0.10 0.00

ARI

0.20

Ward’s

0.20

K−means

73

0

5

10

15

20

25

30

35

0

5

10

15

Epsilon

20

25

30

35

Epsilon

m selected using 95% eigenmultiplier dropoff

0

5

10

15

20

25

30

35

10 8 6 4 2

ARI

0.00

0.10

0.20

# of Dimensions Selected

12

Single

0

Epsilon

5

10

15

20

25

30

35

Epsilon

Figure 5.3: The effect of the choice of  on the quality of clustering solutions and the choice of m. The red line is placed at the ARI for the given algorithm on the original data coordinates what what was observed for k-means and Ward’s. Single linkage also exhibits consistently poor performance between the range of 5 and 15. Interestingly, for  values greater than 16 we see a rise in the number of dimensions selected. This example demonstrates that clustering in diffusion space can give better results while reducing dimensionality than clustering in the original feature space, but care is needed in selecting the tuning parameter .

5.5

Automatic parameter selection for Diffusion Clustering

The results for the Glass dataset indicate that the choice of both  and m are critically important in diffusion clustering. We need to simultaneously account for the relationships between the connectivity of the transition matrix and the number of dimensions needed to “see” the cluster structure. The Ng method deals with this issue by setting m equal to k while  is chosen by searching for the value which minimizes the distortion of the clusters. In another version of spectral clustering, called self-tuning spectral clustering,

CHAPTER 5. USING DIFFUSION MAPS TO DETECT CLUSTER STRUCTURE IN DATA

74

 is replaced by a local scaling. In other settings where diffusion maps are used, one can choose the parameters simultaneously via cross-validation. In this section, we introduce two new approaches to parameter selection. A self-tuning diffusion map procedure that uses local scaling and a cross-validation style procedure for evaluating the quality of a clustering solution called prediction strength which can be used to simultaneously select good tuning parameters.

5.5.1

Self-Tuning Diffusion Maps

The tuning parameter  defines at what scale two points should be considered similar. This global parameter can be problematic if the cluster structure is on different scales. A variant of spectral clustering called self-tuning spectral clustering (Zelnik-Manor & Perona, 2005) replaces w(x, y) = exp(−d(x, y)2 /) with w(x, y) = exp(−d(x, y)2 /σx σy ) where σx and σy are local scaling parameters. The authors suggest using σi = d(x, xK )

(5.3)

where xK is the Kth nearest neighbor of x. Zelnik-Manor & Perona suggest K = 7 as a reasonable default. One benefit of the self-tuning approach is that in dense clusters of points, two points have to be closer to be considered similar than if the points are in a sparse area. This is helpful for detecting structure at different scales. The self-tuning idea can be extended to diffusion maps by incorporating the self-tuning parameters into the dissimilarity matrix. Let √ d0 (x, y) = d(x, y)/ σx σy

(5.4)

w0 (x, y) = exp(−d0 (x, y)2 /1) = exp(−d(x, y)2 /σx σy )

(5.5)

then

Using d0 (x, y) as our dissimilarity measure and setting  = 1 we can construct a self-tuning diffusion map. The self-tuning diffusion map eliminates the need to choose  but still leaves the problem of choosing m. Additionally we have the choice of σi , the local scaling parameter. Zelnik-Manor & Perona suggeset σi = d(x, xK ) where K = 7. We believe this may not scale well for datasets of different sizes. As an alternative, one could choose K to be a small percentage of n; the approach used to select a default  in the DiffusionMap package in R (Richards, 2009). This may be more robust since it will scale with the size of

5.5. AUTOMATIC PARAMETER SELECTION FOR DIFFUSION CLUSTERING

75

the dataset. We could also use a completely different measure of local density. For instance, we could use a kernel density estimate to measure local density. We ran self-tuning diffusion maps, using σi as defined in Equation (5.4) with K = 7 and m selected via the 95% eigenmultiplier dropoff on the FCPS datasets. The results were perfect ARIs for all but two of the datasets when using all three methods. The two exceptions were the EngyTime dataset which we will discuss later and the Target dataset (ARIS of 0.58, 0.59, and 0.69 for k-means, Ward’s and single linkage). Recall the Target dataset was problematic because it had four very small clusters of three points each that could easily be considered as outliers instead of clusters. On the Glass dataset we obtained reasonable ARIs (0.14, 0.14, and 0.16 for k-means, Ward’s and single linkage) but not as good as the best ARIs from Figure 5.3. Self-tuning diffusion maps may be very useful if we anticipate structure at different scales in our data and provide a good default if we only have one chance to select tuning parameters. If we have the resources, we may consider computing several diffusion maps and try to optimize our choice of tuning parameter for some criteria. We discuss such an approach in the following section.

5.5.2

Prediction Strength

Prediction strength (Tibshirani & Walther, 2005) reformulates the clustering problem as a supervised learning problem where we need to estimate the “true” class labels. It was developed as a tool for choosing k, the number of clusters, for problems where it is not known. Here we use it simply to obtain a quantitative score for comparing clustering solutions which differ not necessarily on k but on the tuning parameters m and . Prediction strength is not the only measure we could use to compare solutions. Another possibility is the within cluster sum of squares – the criteria that k-means minimizes. This is the proposed method for choosing  in the Ng method and could be used for diffusion maps as well. We anticipate that criterion will be useful for problems where k-means will work well but may falter in cases where k-means does poorly. Another possibility is to use model-based clustering (Fraley & Raftery, 1998) which formulates the clustering problem as a mixture model which is fit using the EM algorithm. Different model-based clustering solutions can then be compared by their Bayesian Information criterion (BIC) scores. This approach has been used to aid in variable selection (Raftery & Dean, 2006) and could be adapted as method to choose  and m. Both the within-cluster sum of square and the model-based

76

CHAPTER 5. USING DIFFUSION MAPS TO DETECT CLUSTER STRUCTURE IN DATA

approach are tied to specific clustering algorithms. As an alternative, prediction strength gives a score similar to a cross-validation score which is less dependent on a single clustering method. Cross-validation is an approach which has proven very successful for other types of problems where diffusion maps are utilized. Prediction strength splits the data into training and test sets, clusters each separately and then measures how well the training solution “predicts” the test clusters using comemberships of pairs of observations. We denote the clustering operation which splits a dataset X into k clusters C(X, k) and the co-membership matrix D[C(...), X] where C(...) is a clustering solution and X is a dataset that is clustered according to C(...). D[C(...), X]ij is one if observations i and j from X are put in the same cluster using clustering solution C(...) and zero otherwise. The prediction strength algorithm due to Tibshirani & Walther (2005) is as follows: Algorithm: Prediction Strength 1. Split the data into Xtr and Xtest , training and test sets respectively. 2. Cluster the training and test sets separately using the same k, denoted C(Xtr , k) and C(Xtest , k) respectively. 3. Extract the training cluster centers from C(Xtr , k) and assign each test data point to its closest training center. 4. Find the co-membership matrix D[C(Xtr , k), Xtest ], the co-membership matrix for the clustering from the previous step 5. For each cluster in C(Xtest , k), calculate the proportion of observation pairs in that cluster which are also put in the same cluster using the clustering from Step 3. 6. The prediction strength is the minimum value from Step 5 over all the test clusters. Higher prediction scores indicate better clustering solutions in the sense that the training solution predicts the test clusters well.

Tibshirani & Walther (2005) argue that 2-fold prediction strength is sufficient for splitting the data into training and test sets when used for choosing k. However we are measuring

5.5. AUTOMATIC PARAMETER SELECTION FOR DIFFUSION CLUSTERING

77

prediction strength across a multi-dimensional grid of tuning parameters where we would expect smaller differences in clustering solutions when compared to varying only k. To reduce noise associated with the random selection of training and test sets, one option is to average several runs of 2-fold prediction strength on the same data when it is computationally feasible. Our approach for the rest of this section is to calculate prediction strength by splitting the data evenly into training and test sets, getting a prediction strength and then flipping the training set and test set and getting a second prediction strength. We then average the two prediction strengths to get a a final prediction strength score which we call the 2-fold prediction strength. Where indicated in the text, we additionally repeat the 2-fold prediction strength calculation across different training and test sets and average the result over the multiple trials. This helps to reduce variability in the prediction strength due to the random selection of training and test sets. The prediction step in the prediction strength algorithm assigns test points to the nearest training cluster center. Estimation of the training cluster centers can be susceptible to outlier points which move the estimated center away from the majority of the cluster. This effect can be exaggerated in diffusion coordinates where points that are far away from other data points have very large diffusion distances. To combat this effect, we have several options including using a median to estimate the cluster center instead of a mean, as in k-medoids (Kaufman & Rousseeuw, 1990), or the trimmed k-means algorithm (Cuesta-Albertos et al., 1997) which estimates the cluster centers using a trimmed mean that excludes outliers. We can use these robust estimates of the cluster centers directly in Step 3 of the Prediction Strength algorithm. K-medoids or trimmed k-means can also be applied directly to the diffusion coordinates as a viable clustering algorithm in diffusion space. These algorithms are expected to produce better results in the presence of a few outlier points at a slightly higher computational cost. The prediction strength measure can be misleading when the clustering solution finds one giant cluster that contains most of the data points. In this situation since most of the data points belong to the same cluster the co-membership matrix will be mostly ones and prediction strength will be high (it is easy to predict that almost all of the points belong to the same cluster). However, this result is usually not the desired clustering solution. We can identify this situation by looking at the number of points in each cluster and flagging the corresponding prediction strengths.

CHAPTER 5. USING DIFFUSION MAPS TO DETECT CLUSTER STRUCTURE IN DATA

78

5.5.3

Diffusion Clustering using Prediction Strength to choose tuning parameters

Prediction strength was originally conceived to choose k, but we can use it to compare any two clustering solutions. In our case, we calculate the diffusion map for a given choice of tuning parameters and then input the diffusion coordinates as the training and test data in the prediction strength algorithm. We can also use it to choose k if necessary. For different parameter choices, we get a , prediction strength score which is like a cross-validation score for clustering. The score is then compared to the score for other parameter choices in order to make a good choice of tuning parameters. We then have the following Diffusion clustering algorithm with automatic selection of tuning parameters.

Algorithm: Diffusion Clustering using Prediction Strength 1. Choose a grid of reasonable parameter choices for m,  and k (if unknown) 2. For each parameter choice in the grid calculate the corresponding diffusion map 3. Calculate the 2-fold prediction strength for the data in diffusion coordinates 4. Choose the parameter combination that gives the highest 2-fold prediction strength 5. Apply a clustering algorithm to the diffusion map with the parameters that gave the highest 2-fold prediction strength

We also note that this algorithm is easy to parallelize since for each parameter choice the 2-fold prediction strength can be calculated independently.

5.5.4

Visualizing Prediction Strengths Over a Grid of Parameter Choices

The diffusion clustering using prediction strength algorithm yields a prediction strength score for each choice of tuning parameters. We visualize this result in an image where the x-axis corresponds to , the y-axis corresponds to m and the intensity represents the prediction strength at that point on the grid. If we look at one row in the image, we see the fluctuations in prediction strength due to variations in  using a fixed number of

5.5. AUTOMATIC PARAMETER SELECTION FOR DIFFUSION CLUSTERING

79

eigenvectors. If we instead look at a column, we see how the prediction strength changes with the number of eigenvectors. An example corresponding to the Chain Link dataset is shown in Figure 5.4.

10

1.0

8

Prediction Strength

0.8

m

6

0.6

4

0.4

2

0.2

0.0

0.0

0.2

0.4

0.6

0.8

Eps

Figure 5.4: Visualizing prediction strength over a grid of tuning parameters When interpreting these images to find a reasonable set of tuning parameters we follow a few simple principles. First, we prefer parsimonious solutions; this means that for a fixed  if we see high prediction strengths for several values of m we prefer the smallest. It is common to see vertical stripes where the prediction strength is stable for increasing m because of the downweighting of additional eigenvectors when calculating the diffusion map. For example in Figure 5.4 we see that there are several  choices which give very high prediction strengths for any choice of m. In this instance, we would select m=1 to obtain a low-dimensional representation of the data which still preserves the cluster information. Second, if there are swaths of  values with similar prediction strengths choose one in the middle to obtain a solutions which should be stable in the sense that if we were to increase

CHAPTER 5. USING DIFFUSION MAPS TO DETECT CLUSTER STRUCTURE IN DATA

80

or decrease  by a small amount we would still be in a region with high prediction strength. Looking at Figure 5.4 we see for  values close to zero up to about 0.2 we have high prediction strengths. In this case, we would select  to be equal to 0.1. In summary, we look at these plots to find regions of the parameter space that have high prediction strengths and within those regions select parameter choices that should yield stable and parsimonious diffusion maps for clustering. For the example plot in Figure 5.4, we would choose a diffusion map with  = .1 and one eigenvector to cluster.

5.6

Results of Diffusion Clustering using Prediction Strength

5.6.1

Tetra Revisited

On the tetra data, there are four clusters and we see that for any m greater than two and any choice of  on the chosen grid we get both an ARI and prediction strength of one (Figure 5.5). As we would expect, regions of parameter space with high prediction strengths also have high ARIs indicating that choosing the parameters via prediction strength give the best clustering solutions for this dataset. The plots show that the tetra dataset is not overly sensitive to the choice of parameters as long as we use at least three eigenvectors. We also note that we are able to perfectly recover the cluster structure using only a threedimensional diffusion map whereas the Ng method uses a four-dimensional representation and does not get a perfect recovery, although it is very close (ARI of .993). Additionally, even if we vastly overestimate the number of eigenvectors, we do not adversely affect the results. This is a feature of the eigenmultipliers, which as seen in Figure 5.6 quickly drop towards zero after the first three. The additional dimensions do not carry much influence and are unlikely to cause issues as “noise” variables as they might using other methods. It is also worth noting that a large regions of our parameter space has a perfect prediction strength of one, which means there are several sets of “optimal” parameters. We believe this is largely due to the easily separable synthetic dataset; the choice of  is not particularly important for this problem.

5.6.2

Chain Link Revisited

The Chain Link data are much more sensitive to the choice of tuning parameters. As seen in Figure 5.7, values of  less than 0.2 perfectly capture the structure of the data with

5.6. RESULTS OF DIFFUSION CLUSTERING USING PREDICTION STRENGTH

0.8

10

1.0

1.0

8

10

ARIs using Diffusion Single Linkage

8

ARIs using Diffusion K−Means

81

0.8

0.6 m

m

6

6

0.6

0.4 4

4

0.4

0.2 2

2

0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

0.0

1.6

0.2

0.4

0.6

Eps ARI k−Means 1 : Best DFKM 1 : Spec 0.993

0.8

1.0

1.2

1.4

1.6

Eps ARI Single 0 : Best DFSL 1 : Spec 0.993

ARIs using Diffusion Ward

10

0.8

1.0

8

10 8

Prediction Strength

1.0

0.8

0.6 m

m

6

6

0.6

0.4 4

4

0.4

0.2 2

2

0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

0.0

0.2

0.4

0.6

Eps ARI Ward 1 : Best DFWard 1 : Spec 0.993

0.8

1.0

1.2

1.4

1.6

Eps

Figure 5.5: Tetra: Top row and bottom left: ARIs for k-means, single linkage, and Ward’s linkage over a grid of  and m. Bottom Right: Prediction Strengths over the same grid

performance dropping off for  > 0.2. With a good choice of , we can perfectly recover the cluster structure using only one eigenvector as opposed to the two eigenvectors chosen in the Ng method. Although we only need one eigenvector, again we see that inflating the number of eigenvectors does not degrade performance because the eigenmultipliers quickly decay to zero. As was the case with the Tetra data, the prediction strength plot closely mirrors the observed ARIs and would suggest choosing  around 0.1 and m=1 since we see the block of  between zero and 0.2 all have good prediction strengths and we prefer using as few dimensions as possible. This choice of parameters produces a diffusion map on which all three clustering methods perfectly recover the cluster structure.

CHAPTER 5. USING DIFFUSION MAPS TO DETECT CLUSTER STRUCTURE IN DATA

3 0

1

2

Eigen Multiplier

4

5

82

2

4

6

8

10

Eigen Vector Number

Figure 5.6: Drop-off in Eigenmultiplier for Tetra Data

5.6.3

EngyTime Data

We saw that using the default choices of  and m in the DiffusionMap package in R, gave a poor diffusion clustering for the EngyTime dataset shown in Figure 5.8. A few outlier points dominated the first diffusion coordinate leading to a poor solution. In this case, we applied the trimmed k-means version of prediction strength to minimize the effect of the outliers. In Figure 5.9 we show the ARI results of trimmed k-means on the diffusion coordinates as well as the trimmed k-means prediction strength. The prediction strength plot closely mirrors the plot of ARIs using trimmed k-means, so that if we choose  and m based on prediction strength the resulting diffusion clustering will have a high ARI. Prediction strength also identifies the the previously problematic region of parameter space with a lower prediction strength (the default choice which did not perform well was  = .203 and m = 2). Thus the one poor diffusion clustering we had in the FCPS examples was not a failure of the diffusion clustering framework but rather a result of poor tuning parameter choices. Using trimmed k-means in diffusion coordinates with good parameter choices we get an ARI of 0.821 which matches k-means (which should perform well for a mixture of two gaussians) and bests both Ward’s and Single linkage on the original data coordinates. Using k-means

5.6. RESULTS OF DIFFUSION CLUSTERING USING PREDICTION STRENGTH

0.8

10

1.0

1.0

8

10

ARIs using Diffusion Single Linkage

8

ARIs using Diffusion K−Means

83

0.8

0.6 m

m

6

6

0.6

0.4 4

4

0.4

0.2 2

2

0.2

0.0

0.0

0.2

0.4

0.6

0.0

0.8

0.0

0.2

Eps ARI k−Means 0.093 : Best DFKM 1 : Spec 1

0.4

0.6

0.8

Eps ARI Single 1 : Best DFSL 1 : Spec 1

ARIs using Diffusion Ward

10

0.8

1.0

8

10 8

Prediction Strength

1.0

0.8

0.6 m

m

6

6

0.6

0.4 4

4

0.4

0.2 2

2

0.2

0.0

0.0

0.2

0.4 Eps ARI Ward 0.394 : Best DFWard 1 : Spec 1

0.6

0.8

0.0

0.0

0.2

0.4

0.6

0.8

Eps

Figure 5.7: Chain Link: Top row and bottom left: ARIs for k-means, single linkage, and Ward’s linkage over a grid of  and m. Bottom Right: Prediction Strengths over the same grid on the self-tuning diffusion map produced an ARI 0f 0.798 which is also very good.

5.6.4

Hepta Data

The Hepta dataset (Figure 5.10) contains seven well-separated clusters in three dimensions. It is not a particularly difficult dataset to cluster in the original coordinates – k-means, Ward’s and single linkage all perfectly recover the clusters. However using the Ng method as implemented in the Kernlab (Karatzoglou et al., 2004) package we obtained an ARI of 0.81. Upon further examination, the ARI ranged from a low of about 0.6 to a perfect 1 when the procedure was repeated on the same data 100 times (Figure 5.11). Additionally

CHAPTER 5. USING DIFFUSION MAPS TO DETECT CLUSTER STRUCTURE IN DATA

84

Figure 5.8: EngyTime Dataset

ARIs using Diffusion Trimmed k−means

Prediction Strength

1.0

5 4

0.6

0.4

2

2

m 0.4

0.8

3

5 4

0.6

m

0.8

3

6

6

1.0

0.2

1

1

0.2

0.0

1

2

3

4

0.0

1

Eps Best DFTrim 0.820939687551403

2

3

4

Eps

Figure 5.9: EngyTime: Left: ARIs for Trimmed k-means over a grid of  and m. Right: Prediction Strengths using trimmed k-means over the same grid there were nine runs where the algorithm did not produce a solution. This instability may be caused by the Ng method projecting a three-dimensional dataset via seven eigenvectors. Applying k-means in seven dimensions requires the estimation of 49 parameters – 7 centers each in 7 dimensions. The instabilities in the Ng method may be fixable by increasing the

5.6. RESULTS OF DIFFUSION CLUSTERING USING PREDICTION STRENGTH

85

Figure 5.10: The Hepta datasets has 7 clusters in 3 dimensions default number of iterations in the k-means algorithm which would make the algorithm slower. Although we might reach a solution, we pay a computational price. In contrast, using diffusion maps and selecting the parameters via prediction strength we get a twodimensional projection that, as seen in Figure 5.12, clearly separates the clusters. Using k-means on the diffusion map only requires the estimation of 14 parameters and recovered the clusters perfectly each of 100 trials. For comparison, we also looked at running k-means using a 7-dimensional diffusion map and found that it too perfectly recovered the clusters every time over 100 trials (likely because the additional dimensions were down-weighted in the diffusion map).

5.6.5

Glass Revisited

On the Glass data, prediction strength (Figure 5.13) does not seem to give a conclusive answer for which parameters to select. Using trimmed k-means, the highest ARI we obtain is 0.290 although the parameters associated with the highest trimmed prediction strength

CHAPTER 5. USING DIFFUSION MAPS TO DETECT CLUSTER STRUCTURE IN DATA

86

20 0

10

Frequency

30

40

ARI of Ng Method for Hepta Data 100 times

0.6

0.7

0.8

0.9

1.0

Adjusted Rand Index

Figure 5.11: Histogram of ARIs for repeated runs of the Ng Method via its implementation in the Kernlab package in R

produce an ARI of 0.186. This underscores the difficulty in selecting good parameters in practice. Although prediction strength will usually indicate stable regions of parameter space to try, it does not guarantee the best parameter choice and should not be used blindly. The self-tuning diffusion map has ARIs of 0.139, 0.139, and 0.161 for k-means, Ward’s and single linkage respectively. These are not terrible but not great either. We hypothesize that altering the number of nearest neighbors from the suggested default of seven may improve the results; perhaps using prediction strength to select a good number will lead to a better self-tuned diffusion map for clustering. Future work will explore this option.

87

0 −10 −20

Second Diffusion Coordinate

10

5.7. CONCLUSIONS

−20

−10

0

10

20

First Diffusion Coordinate

Figure 5.12: Diffusion map representation of the Hepta dataset. Tuning parameters were chosen using Prediction Strength

5.7

Conclusions

We have shown that projecting data into diffusion space can identify complicated cluster structure, can be used to reduce dimensionality, can help visualize complex datasets and can reduce the performance dependency on the choice of clustering algorithm. These features are a potent combination as evidenced by the success of diffusion-based clustering algorithms on the ten datasets analyzed in this chapter. Compared to other spectral clustering methods, diffusion maps allow us to visualize and categorize the connectivity of a dataset with the physically motivated diffusion distance, which is difficult to compute in the original feature space. Additionally, we are often able to cluster the data using fewer dimensions than standard spectral clustering methods; reducing the dimensionality may improve the stability

CHAPTER 5. USING DIFFUSION MAPS TO DETECT CLUSTER STRUCTURE IN DATA

88

Prediction Strength

ARIs using Diffusion Timmed k−means

12

12

14

1.0

14

1.0

0.8

10

10

0.8

8

0.6 m

8

m

0.6

6

0.4

0.2

2

2

0.2

4

4

6

0.4

0.0

5

10

15

20

25

Eps

0.0

5

10

15

20

Eps

Figure 5.13: Left: Trimmed Prediction Strength for Glass Dataset Right: ARI for Glass Dataset using Trimmed k-means in Diffusion Coordinates and reduce the possibility of noise variables degrading performance. We also examined the choice of clustering algorithm to use post-projection. Lafon & Lee (2006) motivates the use of k-means in diffusion coordinates with a theoretical argument. However, in practice, k-means may not perform as well as other methods to get the best clustering solution. For instance on several datasets single linkage performed better in diffusion space. One reason single linkage may be so effective is its connections to nonparametric clustering (Stuetzle, 2003), an approach that does not restrict the size or shape of the clusters. As such, it may be better than k-means at detecting clusters that are not spherical or lie on manifolds, as is often the case in diffusion space. A promising line of research is the use of other more robust nonparametric clustering methods – especially ones that traditionally may not work well (or at all) in high dimensions but will be more tractable once the observations have been projected into lower dimensional diffusion space. In general, our results demonstrate that clustering in diffusion space is not as sensitive to the choice of clustering algorithm as clustering in the original space. With the glass identification dataset, we have shown the importance of selecting good tuning parameters. We have proposed a procedure for choosing  and m (also k if needed) and shown that on several examples it identifies reasonable parameter choices. Additionally, there may be value in combining the clustering information from a range of parameter choices to obtain an ensemble diffusion map that incorporates the information from different parameter choices and possibly different dissimilarity matrices. We also introduce the self-

25

5.7. CONCLUSIONS

89

tuning diffusion map which incorporates the local density into the similarity matrix. These local parameters may help identify structure at different scales and the method shows promise not just in clustering problems but in other situations where diffusion maps are utilized such as regression or classification. As data acquisition and initial storage become cheaper, methods for finding structure in high-dimensional data have become increasingly more important. Diffusion maps offer an attractive solution to many of these high dimensional problems. Whereas a common approach to clustering is to anticipate the shape and structure of clusters beforehand and choose a clustering method that fits that expectation, we propose not making assumptions about the clusters initially and instead projecting the data into a space where the choice of clustering method is not as important; we allow the projection to separate the clusters while simultaneously reducing the dimensionality of the problem.

Chapter 6

Discussion Current technologies are making it easier to collect and store vast quantities of high quality data. It is often the case that we are interested in detecting clustered structure in this data. The clustered structure could be clusters of pixels or voxels that stand out from the background due to some physical source like an x-ray emitting object or underlying brain activity. Or it may be the case that we have data and want to partition in it some sensible way such that “similar’ observations are grouped together as in the classical clustering setting. This dissertation introduces several new methods whose common theme is reliably detecting the clustered structure in data under a wide range of conditions. In the following sections we review these methods; highlighting their strengths and weaknesses as well as discussing avenues for future work.

6.1

False Cluster Proportion Methods for Astronomical Source Detection

We introduce the astronomical source detection problem in which we are trying to detect astronomical objects, sources, against a noisy background in a telescope image. We treat this as a multiple testing where our goal is to bound the rate of falsely detected sources, as opposed to pixels. The False Cluster Proportion procedure which allows us to make our inferences on the objects of interest, the sources themselves, is adapted for this purpose. Additionally we define a new strength of detection measure for each detected object so that we can quantify our confidence in each detection. The Chandra Deep Field South (CDFS) dataset is presented as a test case for the method but does not initially satisfy the assumptions of the FCP method. We introduce the smooth-then-root method to normalize the image so that the assumptions can be met. Applying FCP on the CDFS data we get the desired control over the rate of false positive detections, but do not detect several sources 91

92

CHAPTER 6. DISCUSSION

that should be detected. The chapter concludes with a series of algorithms that generalize the FCP method to be directly applicable for a wider variety of noise conditions, setting the stage for the use of source enhancing filters to be included in FCP methods and reducing the need for normalizing transforms like smooth-then-root.

6.1.1

Future work

The strength of detection measure tells us the largest false cluster proportion tolerance at which a given cluster will still be included in the source catalog. Another possible method for measuring the strength of a detection is to assign a confidence to the detection. In certain Bayesian source detection methods each detected cluster is assigned a probability of being real. While FCP is not a Bayesian method, we could potentially find a measure such that we could make a statement of coverage for each individual detection as well as the coverage FCP provides for the catalog as a whole. Another area that could be improved is the simulation of confidence supersets. While the methods outlined in the document provide a valid confidence superset, there may be room for improvement both in the size of the confidence superset, which directly affects the power of FCP, and in the computational complexity of the algorithm. The computational complexity of the entire procedure is also a concern especially for extremely large datasets like those LSST is expected to produce. An intriguing option is using a smaller piece of data to estimate tc , the FCP cutoff, for the entire image. In the same vein, if FCP is going to be applied to a series of datasets from the same instrument taken under the same conditions, it may be more practical to estimate tc on a subset of the images and simply apply the cutoff to other images. The effects of estimating the cutoff and the circumstances where it may fail need to be studied more carefully, including how to choose the subsets and what size they should be. The trade-offs between computational speed and algorithm accuracy is an interesting and important problem that would need to be carefully considered if FCP were to be employed on images from LSST or a similar telescope.

6.2. ENHANCEMENTS TO FCP METHODS FOR ASTRONOMICAL SOURCE DETECTION

6.2

93

Enhancements to FCP methods for Astronomical Source Detection

Chapter 3 introduces several extensions to the base FCP method from Chapter 2. The first extension is the introduction of the Multi-scale derivative transform which enhances sources in telescope images. The transform is then integrated into an FCP method which we call MSFCP. MSFCP merges the power of Multi-scale derivatives with the rigorous error controls of FCP. We use simulations to show the increased power of MSFCP over FCP as well as verifying the error rates are as expected. This is evident upon revisiting the CDFS data where we show MSFCP is competitive with other standard methods that do not have such rigorous controls on the rate of falsely detected sources. We also detect two new sources that were missed in the original analysis of the data. Another extension is modifying our procedures to account for images with variations in exposure time. For images where the exposure time varies, we have to revise our methods so that they respect fluctuations exposure time. This includes a new version of the Multi-scale derivative called the ETMSD which finds peaks in the data but ignores peaks created artificially by sharp changes in exposure time. Furthermore, we demonstrate an alternative method for calculating confidence supersets by taking the union of confidence supersets for subsets of the data selected by exposure times. This method provides a smaller confidence superset for data with varying exposure times which will lead to more power when FCP is applied.

6.2.1

Future Work

The Multi-scale derivative is a new method for detecting peaks in the data but is by no means the only way to detect peaks. Several source detection methods use wavelets or other methods for detecting deviations from a smooth background. Wavelets could easily be merged with FCP in the same way that we merged Multi-scale derivatives. We anticipate that a wavelet version of FCP would perform similarly to MSFCP and more work would be needed to analyze the pros and cons of the two approaches. Astronomers have studied wavelets as a source detection method and it may be possible that some of their ideas for dealing with exposure times or other artifacts could be included in a wavelet version of FCP. Our approach to creating a confidence superset using data binned by exposure time

94

CHAPTER 6. DISCUSSION

shows promise for improving power. We bin the data by exposure times and then for each ∑ bin calculate a 1 − αi confidence superset; the union of these supersets form a 1 − αi confidence superset for the entire image. An open question is the appropriate way to choose the bins. On the one hand, many small bins will insure that we have good power across the range of exposure times. On the other hand, as the number of bins increase the αi still need to sum to α so the αi will necessarily have to get smaller as the number of bins increase. This means the superset for each bin gets more conservative. To get the best possible confidence superset from this procedure we need to choose the number of bins to balance these competing issues. In addition to the number of bins, one must choose the exposure times that each bin covers. This will likely depend on the specifics of the dataset.

6.3

Detecting Cluster Structure in Functional Neuroimaging Data

Chapter 4 takes the methods from the previous two chapters and adapts them to handle data from fMRI scanners. The two major extensions of FCP methods are handling points in space and allowing for different classes of clusters. Previously, our methods had assumed our data were images on a regularly spaced grid, which made it easy to define a cluster in terms of adjacent pixels. The fMRI data are not on a regularly spaced grid so we have adapted FCP by introducing a distance parameter that defines how close two points need to be in order to be considered adjacent. Additionally, there are two distinct classes of clusters in the fMRI data corresponding to two different stimuli. First, we develop a classification procedure for differentiating between the classes. We then extend the FCP method to handle multiple classes of clusters by including the class information in the definition of a cluster. Two methods for constructing confidence supersets for this type of data are then introduced. The first is based around the null distribution of p-values while the second is based on smoothing the data and simulating a superset analogous to the MSFCP procedure. We call these two methods the base and smooth methods respectively. Applying both methods to three fMRI experiments we demonstrate that they are effective at capturing the cluster structure in high-resolution fMRI experiments. As expected the smooth method has better power compared to the base method.

6.3. DETECTING CLUSTER STRUCTURE IN FUNCTIONAL NEUROIMAGING DATA

6.3.1

95

Future Work

Before we apply the FCP method, we must first classify the voxels by phase class which is measured on the unit circle. A classifier based on the assumption that the phase classes should be 180◦ apart is trained on a small sample of the data and then applied to the rest. The classification method we use seems to be sufficient for the problem at hand, but is certainly not the only way to classify the points. For instance, support vector machines may be better at classification. We also might consider weighting the phases by their p-value in the classifier since small p-values indicate stronger phase components. Alternatively we could use the p-values to determine the initial set of points used to train the classifier. The smooth method smooths the p-values by taking the median of the p-values for each voxel and its nearest neighbors before running FCP. Further work is needed to determine the optimal number of nearest neighbors needed to balance the trade-off between bias and variance. Additionally, other smoothing methods, perhaps borrowing from the literature on kriging methods, may perform better under certain circumstances. This problem is analogous to the previous Multi-scale derivative and wavelet discussion in that we are trying to find a reasonable transformation of the data such that the clusters are enhanced and the background suppressed. We also could consider smoothing the phases as well as the p-values. The FCP method developed for fMRI data assumes the phase class labels are given. It does not take into account errors due to misclassification. We envision certain instances where the labels may be known a priori but other instances where they have to be inferred from the data. In the latter case, it would be worthwhile to incorporate uncertainty in the class labels in the overall error bound given by FCP. This would require reexamining the entire procedure and how errors are propagated through the steps of the algorithm. The methods developed for fMRI data have been tested on real data where we do not know the true activation patterns – it would be informative to run the same procedures on simulated data. Although we can mathematically show that the probabilistic bound on the rate of falsely detected regions works in this case, simulations could help further understand the properties of the methods. They may highlight if there is room for improvement as well as allowing us to quantify performance under a range of different conditions.

96

6.4

CHAPTER 6. DISCUSSION

Detecting Cluster Structure using Diffusion Maps

Diffusion maps have been successfully applied on several types of statistical problems including dimensionality reduction, regression and classification. We have shown that diffusion maps can be very effective for detecting cluster structure, even in cases where standard methods fail. Projecting data into diffusion coordinates often makes the structure clearer, which in turn makes the choice of clustering algorithm less important than it would be in the original feature space. This is demonstrated on several datasets where standard methods disagree on clustering solutions in the original feature space but agree in diffusion space. Diffusion maps are closely related to the family of spectral clustering methods and we contrast the two approaches using a popular spectral clustering method (Ng et al., 2001). Both methods project data onto the eigenvectors of a transition matrix that represents the connectivity of the observations. The spectral method selects the number of eigenvectors to match the number of clusters whereas we propose data-driven methods for choosing the number of eigenvectors. Using our methods, we present several examples where we are able to recover the cluster structure in the data using fewer eigenvectors than the Ng method. We also show that diffusion maps can be more robust to overestimation of the number of eigenvectors including one example where overestimating the number of eigenvectors needed may have caused serious instabilities in the Ng method. We also examine the choice of the tuning parameter  which determines how far apart two observations have to be before their similarity is shrunk to zero. The glass data demonstrate the importance of this parameter, as clustering solutions varied considerably for different values of . We propose two novel methods for choosing both  and m, the number of eigenvectors to use. The first is the self-tuning diffusion map. Self-tuning diffusion maps replace the global parameter  with a series of local parameters, creating a diffusion map that should be more sensitive to structure occurring on different scales. In practice, we show that self-tuning diffusion maps perform very well on several datasets especially if we do not have the resources to search over a range of possible parameter choices. If we do have the resources to search over different combinations of parameters, we propose using prediction strength to find reasonable parameter choices. On several examples, this method identifies regions of parameter space that produce diffusion maps which highlight the cluster structure in the data making it easier to detect the structure than it would be

6.4. DETECTING CLUSTER STRUCTURE USING DIFFUSION MAPS

97

in the original feature space.

6.4.1

Future Work

Self-tuning diffusion maps show great promise not just in clustering problems but as a way to reduce the dimensionality of a dataset while preserving the intrinsic structures, even when that structure is on varying scales. Once the data is projected into diffusion coordinates we anticipate it could be used in regression, classification or other types of analyses. More work is needed to better understand how to choose the local parameters in a self-tuning diffusion map; cross-validation should be the answer in several situations but may not always be applicable. In the case where we have the resources to search over a grid of parameters, we advocate using prediction strength to choose the tuning parameters. However, there are several other possible choices for ranking clustering solutions including cluster distortion and BIC for model-based clustering algorithms. Future work should give a comprehensive comparison of the different methods with the goal of producing algorithms in which a good set of tuning parameters are chosen without needing input from the user. Ensemble methods have proven very effective in classification problems and counterpart methods exist for clustering. Instead of trying to choose one good set of tuning parameters, another option may be to use several combinations of tuning parameters to construct several diffusion maps and use them collectively to cluster the data. We call this approach the ensemble diffusion map. The ensemble diffusion map is another method that may help account for structure on different scales in the data. Further work is also needed to explore which clustering algorithms perform best in diffusion space. On several examples, single linkage outperformed k-means and Ward’s linkage, we believe this might be due in part to single linkage’s connections to non-parametric clustering. Non-parametric methods may prove to be very effective in diffusion space, where the dimensionality of the data is already reduced.

Bibliography Anscombe, F. (1948). The transformation of poisson, binomial and negative-binomial data. Biometrika, 35 , 246–254. Asuncion, A., & Newman, D. (2007). UCI machine learning repository. URL http://www.ics.uci.edu/∼mlearn/MLRepository.html Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistic Society, B (57), 289–300. Benjamini, Y., Krieger, A. M., & Yekutieli, D. (2006). Adaptive linear step-up procedures that control the false discovery rate. Biometrika, 93 (3), 491–507. Bertin, E., & Arnouts, S. (1996). SExtractor: Software for source extraction. A&AS, 117 , 393–404. Brockwell, P. J., & Davis, R. A. (1998). Time Series: Theory and Methods (Springer Series in Statistics). Springer. Buchman, S. M., Lee, A. B., & Schafer, C. M. (2009). High-Dimensional Density Estimation via SCA: An Example in the Modelling of Hurricane Tracks. ArXiv e-prints. Buckner, R. L. (1998). Event-related fMRI and the hemodynamic response. Human Brain Mapping, 6 , 373–377. Carvalho, P., Rocha, G., & Hobson, M. P. (2009). A fast Bayesian approach to discrete object detection in astronomical data sets - PowellSnakes I. MNRAS, 393 , 681–702. Cash, W. (1979). Parameter estimation in astronomy through application of the likelihood ratio. ApJ, 228 , 939–947. 99

100

BIBLIOGRAPHY

Coifman, R. R., & Lafon, S. (2006). Geometric harmonics: A novel tool for multiscale out-of-sample extension of empirical functions. App. Comput. Harmon. Anal., 21 (5), 31–52. Cuesta-Albertos, J. A., Gordaliza, A., & Matran, C. (1997). Trimmed k-means: An attempt to robustify quantizers. The Annals of Statistics, 25 (2), 553–576. Damiani, F., Maggio, A., Micela, G., & Sciortino, S. (1997). A method based on wavelet transforms for source detection in photon-counting detector images. i. theory and general properties. The Astrophysical Journal , 483 (1), 350–369. Evett, I. W., & Spiehler, E. J. (1987). Rule induction in forensic science. Tech. rep., Central Research Establishment, Home Office Forensic Science Service. Fraley, C., & Raftery, A. E. (1998). How many clusters? which clustering method? answers via model-based cluster analysis. The Computer Journal , 41 (8), 578–588. Freeman, P. E., Kashyap, V., Rosner, R., & Lamb, D. Q. (2002). A wavelet-based algorithm for the spatial analysis of poisson data. The Astrophysical Journal Supplement Series, 138 (1), 185–218. Friedenberg, D. A., & Genovese, C. R. (2009). Straight to the Source: Detecting Aggregate Objects in Astronomical Images with Proper Error Control. ArXiv e-prints. Frigo, M., & Johnson, S. G. (2005). The design and implementation of FFTW3. Proceedings of the IEEE , 93 (2), 216–231. Genovese, C. R. (2000). A bayesian time-course model for functional magnetic resonance imaging data. Journal of the American Statistical Association, 95 (451), 691–703. Giacconi, R., Zirm, A., Wang, J., Rosati, P., Nonino, M., Tozzi, P., Gilli, R., Mainieri, V., Hasinger, G., Kewley, L., Bergeron, J., Borgani, S., Gilmozzi, R., Grogin, N., Koekemoer, A., Schreier, E., Zheng, W., & Norman, C. (2002). Chandra deep field south: The 1 ms catalog. The Astrophysical Journal Supplement Series, 139 (2), 369–410. Gonzalez-Nuevo, J., Argueso, F., Lopez-Caniego, M., Toffolatti, L., Sanz, J. L., Vielva, P., & Herranz, D. (2006). The mexican hat wavelet family. application to point source detection in cmb maps. MON.NOT.ROY.ASTRON.SOC., 369 , 1603.

BIBLIOGRAPHY

101

Guglielmetti, F., Fischer, R., & Dose, V. (2009). Background-source separation in astronomical images with Bayesian probability theory - I. The method. MNRAS, 396 , 165–190. Heller, R., Stanley, D., Yekutieli, D., Rubin, N., & Benjamini, Y. (2006). Cluster-based analysis of fmri data. Neuroimage, 33 (2), 599–608. Herschel, W. (1786). Catalogue of One Thousand New Nebulae and Clusters of Stars. By William Herschel, LL.D. F. R. S. Royal Society of London Philosophical Transactions Series I , 76 , 457–499. Hobson, M. P., & McLachlan, C. (2003). A Bayesian approach to discrete object detection in astronomical data sets. MNRAS, 338 , 765–784. Hopkins, A. M., Miller, C. J., Connolly, A. J., Genovese, C., Nichol, R. C., & Wasserman, L. (2002). A New Source Detection Algorithm Using the False-Discovery Rate. AJ, 123 , 1086–1094. Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2 (1), 193–218. Karatzoglou, A., Smola, A., Hornik, K., & Zeileis, A. (2004). kernlab – an S4 package for kernel methods in R. Journal of Statistical Software, 11 (9), 1–20. Kaufman, L., & Rousseeuw, P. (1990). Finding Groups in Data An Introduction to Cluster Analysis. New York: Wiley Interscience. Kosowsky, A. (2003). The atacama cosmology telescope. New Astronomy Reviews, 47 , 939. Lafon, S., & Lee, A. B. (2006). Diffusion maps and coarse-graining: A unified framework for dimensionality reduction, graph partitioning, and data set parameterization. IEEE Trans. Pattern Anal. Mach. Intell., 28 (9), 1393–1403. Lindquist, M. (2008). The Statistical Analysis of fMRI Data. Statistical Science, 23 (4), 439–464. Luo, B., Bauer, F. E., Brandt, W. N., Alexander, D. M., Lehmer, B. D., Schneider, D. P., Brusa, M., Comastri, A., Fabian, A. C., Finoguenov, A., Gilli, R., Hasinger, G., Hornschemeier, A. E., Koekemoer, A., Mainieri, V., Paolillo, M., Rosati, P., Shemmer, O.,

102

BIBLIOGRAPHY

Silverman, J. D., Smail, I., Steffen, A. T., & Vignali, C. (2008). The chandra deep field-south survey: 2 ms source catalogs. The Astrophysical Journal Supplement Series, 179 (1), 19–36. Mardia, K. V., Kent, J. T., & M., B. J. (1979). Multivariate Analysis. Academic Press. Markevitch, M., Gonzalez, A. H., Clowe, D., Vikhlinin, A., Forman, W., Jones, C., Murray, S., & Tucker, W. (2004). Direct Constraints on the Dark Matter Self-Interaction Cross Section from the Merging Galaxy Cluster 1E 0657-56. ApJ, 606 , 819–824. Meinshausen, N., & Rice, J. (2006). Estimating the proportion of false null hypotheses among a large number of independently tested hypotheses. Ann. Statist, 34 , 373–393. Melin, J.-B., Bartlett, J. G., & Delabrouille, J. (2006). Catalog extraction in SZ cluster surveys: a matched filter approach. A&A, 459 , 341–352. Nadler, B., Lafon, S., Coifman, R. R., & Kevrekidis, I. G. (2005). Diffusion maps, spectral clustering and eigenfunctions of fokker-planck operators. In in Advances in Neural Information Processing Systems 18 , (pp. 955–962). MIT Press. Ng, A. Y., Jordan, M. I., & Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems 14 , (pp. 849–856). MIT Press. Perone Pacifico, M., Genovese, C., Verdinelli, I., & Wasserman, L. (2004). False discovery control for random fields. Journal of the American Statistical Association, 99 (468), 1002– 1014. Piterbarg, V. (1996). Asymptotic Methods in the theory of Gaussian processes and fields. American Mathematical Society. Raftery, A. E., & Dean, N. (2006). Variable selection for model-based clustering. Journal of the American Statistical Association, 101 (473), 168–178. Richards, J. A., & Jia, X. (1999). Remote Sensing Digital Image Analysis: An Introduction. Secaucus, NJ, USA: Springer-Verlag New York, Inc. Richards, J. W. (2009). diffusionMap: Diffusion map. R package version 1.0-0. URL http://CRAN.R-project.org/package=diffusionMap

BIBLIOGRAPHY

103

Richards, J. W., Freeman, P. E., Lee, A. B., & Schafer, C. M. (2009a). Accurate parameter estimation for star formation history in galaxies using SDSS spectra. MNRAS, 399 , 1044–1057. Richards, J. W., Freeman, P. E., Lee, A. B., & Schafer, C. M. (2009b). Exploiting LowDimensional Structure in Astronomical Spectra. ApJ, 691 , 32–42. Ross, S. M. (2006). Introduction to Probability Models. Academic Press. Savage, R. S., & Oliver, S. (2007). Bayesian Methods of Astronomical Source Extraction. ApJ, 661 , 1339–1346. Sehgal, N., Trac, H., Huffenberger, K., & Bode, P. (2007). Microwave Sky Simulations and Projections for Galaxy Cluster Detection with the Atacama Cosmology Telescope. The Astrophysical Journal , 664 , 149–161. Shim, H., Im, M., Pak, S., Choi, P., Fadda, D., Helou, G., & Storrie-Lombardi, L. (2006). Deep u*- and g-band imaging of the spitzer space telescope first look survey field: Observations and source catalogs. The Astrophysical Journal Supplement Series, 164 (2), 435–449. Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 64 (3), 479–498. Strong, A. W. (2003). Maximum entropy imaging with integral/spi data. A&A, 411 (1), L127–L129. Stuetzle, W. (2003). Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. Journal of Classification, Volume 20 , 025–047. Sun, W., & Cai, T. T. (2007). Oracle and adaptive compound decision rules for false discovery rate control. Journal of the American Statistical Association, 102 (479), 901– 912. Tibshirani, R., & Walther, G. (2005). Cluster validation by prediction strength. Journal of Computational & Graphical Statistics, 14 (3), 511–528. Tyson, J. A., & the LSST Collaboration (2002). Large synoptic survey telescope: Overview. PROC.SPIE INT.SOC.OPT.ENG., 4836 , 10.

104

BIBLIOGRAPHY

Ultsch, A. (2005). Clustering with som: U*c. Proc. Workshop on Self-Organizing Maps, Paris, France, (pp. 75–82). Vale, C., & White, M. (2006). Finding clusters in sz surveys. NEW ASTRON., 11 , 207. Valtchanov, I., Pierre, M., & Gastaud, R. (2001). Comparison of source detection procedures for XMM-Newton images. A&A, 370 , 689–706. Vikhlinin, A., Forman, W., Jones, C., & Murray, S. (1995). Matched Filter Source Detection Applied to the ROSAT PSPC and the Determination of the Number-Flux Relation. ApJ , 451 , 542–+. Virani, S. N., Treister, E., Urry, C. M., & Gawiser, E. (2006). The extended chandra deep field-south survey: X-ray point-source catalog. The Astronomical Journal , 131 (5), 2373. von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17 (4), 395–416. Weisskopf, M., Tananbaum, H., van Speybroeck, L., & O’dell, S. (2000). Chandra X-Ray Observatory: Overview, X-Ray Optics, Instruments, and Missions. Proc. SPIE , 4012 . Worsley, K., Liao, J., Aston, C., Petre, V., Duncan, G., Morales, F., & Evans, A. (2002). A general statistical analysis for fmri data. NeuroImage, 15 , 1–15. Worsley, K., Marrett, S., Vandal, A., Friston, K., & Evans, A. (1996). A unified statistical approach for determining significant statistical signals in imagea of cerebreal activation. Human Brain Mapping, 4 , 58–73. York, D. G., Adelman, J., Anderson, J. E., Jr., Anderson, S. F., Annis, J., Bahcall, N. A., Bakken, J. A., Barkhouser, R., Bastian, S., Berman, E., Boroski, W. N., Bracker, S., Briegel, C., Briggs, J. W., Brinkmann, J., Brunner, R., Burles, S., Carey, L., Carr, M. A., Castander, F. J., Chen, B., Colestock, P. L., Connolly, A. J., Crocker, J. H., Csabai, I., Czarapata, P. C., Davis, J. E., Doi, M., Dombeck, T., Eisenstein, D., Ellman, N., Elms, B. R., Evans, M. L., Fan, X., Federwitz, G. R., Fiscelli, L., Friedman, S., Frieman, J. A., Fukugita, M., Gillespie, B., Gunn, J. E., Gurbani, V. K., de Haas, E., Haldeman, M., Harris, F. H., Hayes, J., Heckman, T. M., Hennessy, G. S., Hindsley, R. B., Holm, S., Holmgren, D. J., hao Huang, C., Hull, C., Husby, D., Ichikawa, S.-I.,

BIBLIOGRAPHY

105

Ichikawa, T., Ivezic, Z., Kent, S., Kim, R. S. J., Kinney, E., Klaene, M., Kleinman, A. N., Kleinman, S., Knapp, G. R., Korienek, J., Kron, R. G., Kunszt, P. Z., Lamb, D. Q., Lee, B., Leger, R. F., Limmongkol, S., Lindenmeyer, C., Long, D. C., Loomis, C., Loveday, J., Lucinio, R., Lupton, R. H., MacKinnon, B., Mannery, E. J., Mantsch, P. M., Margon, B., McGehee, P., McKay, T. A., Meiksin, A., Merelli, A., Monet, D. G., Munn, J. A., Narayanan, V. K., Nash, T., Neilsen, E., Neswold, R., Newberg, H. J., Nichol, R. C., Nicinski, T., Nonino, M., Okada, N., Okamura, S., Ostriker, J. P., Owen, R., Pauls, A. G., Peoples, J., Peterson, R. L., Petravick, D., Pier, J. R., Pope, A., Pordes, R., Prosapio, A., Rechenmacher, R., Quinn, T. R., Richards, G. T., Richmond, M. W., Rivetta, C. H., Rockosi, C. M., Ruthmansdorfer, K., Sandford, D., Schlegel, D. J., Schneider, D. P., Sekiguchi, M., Sergey, G., Shimasaku, K., Siegmund, W. A., Smee, S., Smith, J. A., Snedden, S., Stone, R., Stoughton, C., Strauss, M. A., Stubbs, C., SubbaRao, M., Szalay, A. S., Szapudi, I., Szokoly, G. P., Thakar, A. R., Tremonti, C., Tucker, D. L., Uomoto, A., Berk, D. V., Vogeley, M. S., Waddell, P., i Wang, S., Watanabe, M., Weinberg, D. H., Yanny, B., , & Yasuda, N. (2000). The sloan digital sky survey: Technical summary. The Astronomical Journal , 120 (3), 1579–1587. Zelnik-Manor, L., & Perona, P. (2005). Self-tuning spectral clustering. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.) Advances in Neural Information Processing Systems 17 , (pp. 1601–1608). Cambridge, MA: MIT Press.