Original Article
Cytometric Fingerprinting: Quantitative Characterization of Multivariate Distributions Wade T. Rogers,1,2,3* Allan R. Moser,1,2 Herbert A. Holyst,1,3 Andrew Bantly,3 Emile R. Mohler III,4 George Scangas,3 Jonni S. Moore3
1
Cira Discovery Sciences, Inc., Philadelphia, Pennsylvania
2
Perinatology Research Branch, Intramural Division, NICHD/NIH/DHHS, Hutzel Women’s Hospital, Detroit, Michigan
3
Department of Pathology and Laboratory Medicine and the Flow Cytometry and Cell Sorting Resource Laboratory, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania
4
Vascular Medicine Section, Cardiovascular Division, Department of Medicine, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania
Received 25 September 2007; Revision Received 5 December 2007; Accepted 16 January 2008 Grant sponsor: National Institute of Child Health and Human Development, NIH, DHHS. *Correspondence to: Wade T. Rogers, University of Pennsylvania, 207 John Morgan Bldg., Philadelphia, PA 19104-6082. Email:
[email protected] Published online 27 March 2008 in Wiley InterScience (www.interscience. wiley.com) DOI: 10.1002/cyto.a.20545 © 2008 International Society for Advancement of Cytometry
Cytometry Part A 73A: 430441, 2008
Abstract Recent technological advances in flow cytometry instrumentation provide the basis for high-dimensionality and high-throughput biological experimentation in a heterogeneous cellular context. Concomitant advances in scalable computational algorithms are necessary to better utilize the information that is contained in these high-complexity experiments. The development of such tools has the potential to expand the utility of flow cytometric analysis from a predominantly hypothesisdriven mode to one of discovery, or hypothesis-generating research. A new method of analysis of flow cytometric data called Cytometric Fingerprinting (CF) has been developed. CF captures the set of multivariate probability distribution functions corresponding to list-mode data and then ‘‘flattens’’ them into a computationally efficient fingerprint representation that facilitates quantitative comparisons of samples. An experimental and synthetic data were generated to act as reference sets for evaluating CF. Without the introduction of prior knowledge, CF was able to ‘‘discover’’ the location and concentration of spiked cells in ungated analyses over a concentration range covering four orders of magnitude, to a lower limit on the order of 10 spiked events in a background of 100,000 events. We describe a new method for quantitative analysis of list-mode cytometric data. CF includes a novel algorithm for space subdivision that improves estimation of the probability density function by dividing space into nonrectangular polytopes. Additionally it renders a multidimensional distribution in the form of a onedimensional multiresolution hierarchical fingerprint that creates a computationally efficient representation of high dimensionality distribution functions. CF supports both the generation and testing of hypotheses, eliminates sources of operator bias, and provides an increased level of automation of data analysis. ' 2008 International Society for Advancement of Cytometry
Key terms flow cytometry; cytomics; bioinformatics; computational biology; machine learning; empirical modeling
IN the past decade, flow cytometry, along with other high information content technologies, has experienced exponential growth. The power of flow cytometry to allow the simultaneous measurement of multiple cellular components of individual cells in a heterogeneous environment makes it an ideal technology for high-dimensional biological studies. Along with other -‘‘omics’’ technologies (such as genomics and proteomics), cytomics, facilitated by high-dimensional flow cytometry, has shifted scientific investigation from purely hypothesis-driven to a more discovery, or hypothesis-generating, mode. With such approaches, complex cellular phenotypes and interactions can be elucidated and can contribute to the understanding of such important clinical questions as successful vaccination strategies (1–3), and the nature and role of stem cells in health and disease (4–6). The availability of advanced instrumentation coupled with expansion in the selection of probes has extended the capabilities of multivariate analyses of cells and cell systems to the point that conven-
ORIGINAL ARTICLE tional data-analytical paradigms have become the rate-limiting step in the overall process of collecting, analyzing, and interpreting complex data sets (7–10). Twenty-first century flowcytometric technology will thus benefit from the development of new approaches to the analysis of data, especially those that facilitate the application of machine learning and data mining methodologies. Conventional flow cytometry typically measures events that are variable in only a few dimensions. In the best cases the data are quantal; that is, cells express particular antigens according to tightly regulated processes causing the number of such molecules per cell to fall within a narrow range. Thus, a two-dimensional projection of the data for two cell-surface markers characterizing a population of cells can be divided into four quadrants into which individual clusters are segregated. Variation in the choice of thresholds to separate positive events from negative events on each axis has a negligible influence on estimates of the density of events in each quadrant. More complex multiparameter distributions are usually simplified into a series of two-dimensional distributions by sequential gating (2) or, in some cases, more sophisticated principal component analysis (PCA) (8), to represent the high-dimensionality data in a lower-dimensionality projection which retains most of the variance in the dataset (11). However, these dimensionality-reducing approaches possess a number of drawbacks that limit their suitability for the analysis of complex populations. First, serial pairwise analysis decouples parameter correlations that may be important in defining complex populations. Additionally, analytical bias may be introduced through multiple subjective operator interventions, which is of particular relevance in the analysis of complex, nonquantal distributions. Finally, sequential gating involves the imposition of selection criteria on the data that do not allow for new interpretations of poorly characterized populations. Analysis via gating remains the most effective method for filtering data using well-established criteria (e.g., removing data associated with dead cells, or counting populations in highly optimized and quantal assays). In some cases the application of expert judgment in the visual design of gating strategies may be able to isolate events of biological interest even in the presence of confounding experimental (or biological) variation that will be difficult to account for automatically. Nevertheless there is a strong need for tools to transform and represent multiparameter flow cytometric data in a form efficiently amenable to methods of machine learning and data mining. We describe here a new algorithm for the analyses of high-dimensional cytometric data termed Cytometric Fingerprinting (CF). CF extends the approach of Probability Binning (PB) (7) and its companion method Frequency Difference Gating (12). The objective of CF is to represent the information contained in cytometric list-mode files in a form that enables quantitative comparison among samples. These fingerprints are capable of capturing and encoding the full multivariate correlations of complex, high-dimensional cytometric data. This representation is particularly useful when cell populations are not clearly delineated by optimized assays and the distribution of events in the multiparameter space is not bimodal. Cytometry Part A 73A: 430441, 2008
MATERIALS AND METHODS Cell Staining and Flow Cytometric Analysis Peripheral blood mononuclear cells (PBMCs) were isolated from 60 ml of fresh blood from seven healthy donors as described in Ref. 6, collected under the guidelines of the Institutional Review Board of the University of Pennsylvania School Medicine. Cells were counted and split into two groups: Pool no. 1 and Pool no. 2, as well as a compensation control group. The compensation control set contained 107 cells, and the Pool no. 1 and Pool no. 2 sets each contained 2 3 107 cells. The two pools of cells were each stained with a different antibody combination to create a known variable subset of events embedded in a background distribution. Antibody labeling was performed according to standard procedures. Briefly, all cells were blocked with murine IgG (10 lg per 107 cells) (Sigma, St. Louis, MO) for 10 min at room temperature, and then either stained with 60 ll anti-human CD45 FITC (BD Biosciences Pharmingen, San Jose, CA), 15 ll anti-human CD4 PE (Invitrogen/Caltag Laboratories, Carlsbad, CA), and 15 ll anti-human CD3 APC (Invitrogen/Caltag Laboratories, Carlsbad, CA) for Pool no. 1, or 60 ll anti-human CD45 FITC, 15 ll anti-human CD4 PE, and 15 ll anti-human CD64 APC (Invitrogen/Caltag Laboratories Carlsbad, CA) for Pool no. 2. Three compensation control tubes were labeled with single antibodies as in Pool no. 1. A fourth, unstained control tube had 100 ll of staining buffer (DPBS 1 1.0% BSA) added to it. After labeling, cells were incubated for 30 min at 48C in the dark, washed twice with 2 ml of staining buffer, and finally resuspended in 2 ml of staining buffer. A series of collection tubes was made by aliquoting equal numbers (1 3 106) of Pool no. 2 cells at a final volume of 1 ml. Cell Sorting A BD FACS Aria was used to sort Pool no. 1 or Pool no. 2 cells into the tubes containing 1 million Pool no. 2 cells. Sorted population gating was based on a light scatter for lymphocytes and monocytes [P1 gate], singlets [P2 gate], CD45(1) FITC A vs. SSC A [P3 gate], and CD4(1) PE A vs. CD3(1) APC A [P4 gate], after compensation as shown in Figure 1. A series of increasing numbers (10, 32, 100, 316, 1,000, 3,162, 10,000, 31,623, or 100,000 cells) of Pool no. 1 cells were sorted into the collection tubes containing 1 3 106 of Pool no. 2 cells. The resulting composite distributions are shown on the lower part of Figure 1. Three tubes into which Pool no. 2 cells (10, 100, or 1,000 cells) were sorted into Pool no. 2 collection tubes were used as controls. Flow Cytometry Data Acquisition Data were acquired on a BD FACS Canto immediately after sorting for each experiment. PMT voltages were determined using the compensation control tubes. Once voltages were established, during the first experiment, Sphero Rainbow Calibration Particles (eight peaks) (Spherotech, Lake Forest, IL) were used to standardize fluorescence measurements across experiments. To do this, 105 beads were collected, a tight gate was set around the three brightest peaks for the 431
ORIGINAL ARTICLE
Figure 1. Gating strategy and event distribution in spiking experiment. Upper panel: Gating strategy for spiking Pool no. 1 cells via cell sorting into aliquots of Pool no. 2 cells. Lower panel: Distributions of spiked samples at nine different spike concentrations (corresponding to the concentrations in Table 1) are shown. The boxes show the location of the spiked events.
FITC, PE, and APC channels, and the median intensity for each peak was recorded. PMT voltages were then adjusted if necessary to produce the same median intensities for each parameter compared with the first experiment. For cellular events, the threshold was set at FSC 30,000. One hundred thousand ungated events for each sample were collected. For the Pool no. 2 cells sorted into Pool no. 2, five tubes were collected of each sample to be used as controls. In between each collection of a tube, buffer only was run until no events were detectable. After all tubes were collected, the Spherotech 8-peak beads were run a second time to see if there were any changes in the median intensities of the three bright peak gates for each channel. When performing the experiments that followed, the PMT voltages for the samples were determined by using the predrawn gates used to monitor possible shifts in the median intensity in the first experiment.
Flow Cytometry Data Analysis FCS3.0 files were analyzed using FlowJo Software version 6.4.7 (TreeStar Software, Ashland, OR). Population gating, with gates corresponding to those described above for the sort, was used in order to establish reference numbers of Pool 432
no. 1 cells in each tube. These numbers were used to determine the accuracy of event counts obtained with the CF method described below. Cytometric Fingerprints were generated and analyzed using custom-written software in C, Perl, MatLab version 6.5.0 (The Mathworks, Natick, MA) and R version 2.5.1.
Cytometric Fingerprints CF aims to capture the set of multivariate probability distribution functions corresponding to a set of list-mode files and then ‘‘flatten’’ them into a computationally efficient fingerprint representation that facilitates quantitative comparisons of samples. The CF algorithm starts by specifying the number of hierarchical levels L for the representation. The total number of bins nB into which the space is to be divided is related to the number of hierarchies by the expression nB 5 2L 2 1. (Note that the undivided space is considered to be the first level in the hierarchy. Thus, the total number of bins is an odd number.) The next step is to determine boundaries that recursively subdivide the multidimensional space into regions of equal probability. The subdivision of space is done in such a way that at each division the parent bin is divided by a Cytometric Fingerprinting
ORIGINAL ARTICLE
Figure 2. Schematic outline for oblique recursive subdivision. The method starts with a list-mode file for the template dataset D. After initialization, the algorithm recursively processes events in parent bins at a given level of resolution, forming children bins at the next higher level of resolution. The algorithm terminates when the specified number of bins has been generated.
hyperplane perpendicular to the direction of maximum variance of the data within the bin. Thus, the variance of data within a bin is maximally reduced at each subdivision subject to the constraint that the data are divided into approximately equal portions. PCA is utilized to find the direction of maximum variance (11,13–17). The procedure, termed Oblique Recursive Subdivision, is outlined in Figure 2. It is noteworthy that while the bin boundaries consist of the intersection of hyperplanes in a p-dimensional space, these boundaries need not be explicitly stored. All of the information necessary to bin new data is contained in the hierarchy of rotation matrices Rn and median split values described in Figure 2. Once bin boundaries have been determined from a data set (denoted Dtemplate), the procedure for binning another data set, Dnew, utilizing this bin description is straightforward. The procedure is identical to that described earlier except that the rotation matrices and median values from Dtemplate are used rather than being recalculated from the new data set. The binning method utilized by oblique recursive subdivision is similar to that employed by PB (7). Both recursively subdivide space into bins containing equal numbers of events. For multivariate distributions, both use variance as splitting criteria. The primary difference is that PB is constrained to split along the coordinate axes, whereas oblique recursive subCytometry Part A 73A: 430441, 2008
division splits the data in the actual direction of maximum variance. As stated in Ref. 7, ‘‘By choosing the parameter with the largest variance, the algorithm is weighted towards assigning distinct clusters of events into distinct bins (or sets of bins).’’ Thus, oblique recursive subdivision may be regarded as a generalization of PB that more accurately achieves this goal. A simple pictorial example showing the difference between the binning achieved by PB versus that of oblique recursive subdivision is shown in Figure 3. This example shows two distributions of events in a two-dimensional space. Splitting in the direction of maximum variance, as is done in oblique recursive subdivision, places the events from these two distributions in separate bins starting from the first level of recursion. Constraining the split direction to a coordinate axis, as is done in PB, creates bins that mix events from both distributions; even in the higher resolution bins. Of course, there is no way of knowing whether such distributions exist in real data. Additionally, if enough data exist to create sufficiently high-resolution bins, the event distributions will eventually be separated by either binning method. However, distributions such as this could, in principle, exist at any resolution within data. As discussed in Ref. 7, there are potential disadvantages to splitting in the direction of maximum variance. Distributions that are spread out in one dimension will be split through the middle making the method sensitive to subtle shifts in the position of the bulk of the event population. Note that while this may present a problem for rare event detection, it can also be exploited as a means of quality control (18). Additionally, care must be taken that all parameters are transformed such that their variances have approximately equal scale. Importantly, log-normally distributed values (such as is typical with fluorescence parameters) should be log-transformed prior to binning. All of the data presented in the ‘‘Results’’ section later were treated in this way. Generation of Cytometric Fingerprints A simple one-dimensional representation of the density variations relative to a template may be obtained by recording the densities for the bins in the form of a list. We call this representation a ‘‘fingerprint’’ because it distinguishes individual samples (referred to as ‘‘instances’’) represented by list-mode data. Figure 4 schematically illustrates this procedure. Fingerprints are derived as follows:
i. For a set of M instances {S1, S2, . . . , SM}, aggregate the events from all of the instances to form a composite denoted S. ii. Find the bins for S. iii. Bin the data for each instance Si (i 5 1, 2, . . . , M), using the bins from Step ii. iv. Convert the event counts in each bin into an event fraction by dividing each count by the total number of events in the data set. The lists of binned event densities for the set of instances, S1, S2, . . ., SM, form a set of fingerprints for these data relative to the probability density estimated from the composite data set, 433
ORIGINAL ARTICLE
Figure 3. Comparison of rectangular and oblique binning. The figure illustrates the differences between the binning method utilized by probability binning and oblique recursive subdivision. One thousand two-dimensional events were generated such that they fall into the two elliptical clusters shown in red and blue. The figure on the left shows the result for probability binning. The figure on the right shows the result for oblique recursive subdivision. The thick brown line indicates the location of the first subdivision. Probability binning splits the events along the Parameter 2 axis creating two bins having mixed blue-cluster and red-cluster events, whereas oblique recursive subdivision creates two bins which are each pure in red and blue events. Subsequent bin divisions are indicated by successively thinner lines in green (second level recursion), purple (third level), light brown (fourth level), and black (fifth level). Even at the fifth level of recursion, two bins remain ‘‘mixed’’ with probability binning, although at successively higher levels of resolution, few events are in the mixed bins. For oblique recursive subdivision, because clusters are separated at the first level of recursion, bins formed at every level of recursion remain pure.
Figure 4. Schematic representation of fingerprint formation. This diagram shows two instances binned with respect to template bins found using oblique recursive subdivision. In this schematic example, there are a total of 63 bins. The thickest brown diagonal line separates the data into two level 1 bins; the thick blue lines separate each of these bins into the four level 2 bins; and so on. The fingerprints to the right of the dot plots display raw event counts (or P values obtained from Poisson statistics) for each bin, with the color intensity encoding the amplitude of the feature. Fingerprints are organized in resolution hierarchies, with low resolution features to the left progressing to successively higher resolution features to the right. There are 2L features within a resolution hierarchy where L is the level of resolution. Thus, the rightmost 32 features indicate the fingerprint values at the highest level of resolution. The dot plot for the ‘‘disease’’ instance has a well-defined cluster of cells in the lower right-hand bin which is not present in the normal instance. The arrows indicate the effect of this cluster in the disease instance fingerprint at multiple levels of resolution.
434
Cytometric Fingerprinting
ORIGINAL ARTICLE and may be used, for example, to detect and quantify deviations from a norm (18). Another variation of fingerprinting is particularly useful for classification. The goal is to emphasize differences between samples belonging to different classes. For classification problems, one typically has a set of training instances for which the class identity is known and a set of test or validation instances for which the class identity is unknown. The training data may be aggregated to construct class-specific templates. Fingerprints for individual instances, from both the training set and test or validation sets can be computed utilizing bins derived from the templates. Fingerprints from the training instances may then be used to construct supervised classifiers with which to classify fingerprints for test or validation instances.
RESULTS Spiked Sample Experiment and Analysis A series of ‘‘spiking’’ experiments was carried out to demonstrate the use of cytometric fingerprints for quantitative analysis of cell populations. The basis for these experiments was to simulate a potential phenotypically unique population of cells present at varying concentrations depending on some condition or treatment. This was done by creating a series of concentrations of minority cell populations (Pool no. 1) spiked into a majority cell population (Pool no. 2). Fifteen list-mode files were obtained from tubes with nominal spiked cell concentrations increasing in half-decade increments from 0.001% to 10%. Additionally, 14 list-mode files were acquired for control samples having only Pool no. 2 cells. The actual concentrations of spiked cells varied from the nominal values because of uncertainties in the sorting process and the number of reservoir Pool no. 2 cells into which the Pool no. 1 cells were spiked. For the population gating analysis, a set of sequential gates corresponding to that used to sort Pool no. 1 cells into the Pool no. 2 reservoir was used to determine the reference number of spike cells in each sample. By contrast, the CF method did not have access to knowledge of the location and distribution of the spiked events. Rather, the goal was to automatically determine the location and concentration of spiked cells for each concentration experiment independently by direct comparison with the Pool no. 2 reference set. To do this, the only assumption made by CF was that some events may be present in one distribution that were absent in the other. Thus, the objective of the analysis in this section was to demonstrate the ability of CF to generate hypotheses (i.e., automatically determine unique cell population locations and concentrations) and to determine the concentration limits for which such hypothesis generation is possible. As discussed previously, cytometric fingerprints are onedimensional lists of event counts in bins formed with respect to a template instance. Event counts for list-mode files binned with respect to this template show density variations for that sample with respect to the template. In general, to compare two sets of samples, one could choose to aggregate all instances to form a template; choose one of the two sets to form the template; or form templates from each set and create two sets of fingerprints Cytometry Part A 73A: 430441, 2008
for each instance. For the spiking experiments, we chose to form template bins based on each spiked experimental sample and then create fingerprints for the control samples with respect to each of these templates. Qualitatively, what one expects to see with this approach is that a cluster of Pool no. 1 cells in the spiked sample will cause some number of bins to be localized in the region of parameter space occupied by this cluster. The number of bins formed in the region of Pool no. 1 cells depends on the concentration and degree of localization of the spiked cells. Binning unspiked samples with respect to this template will result in a fingerprint with low event counts in the bins associated with the spiked cell cluster. We note that in a setting where it is not known in advance in which group of samples a cluster is likely to occur, samples from all groups can independently be used to generate templates and each sample can then be binned against each template—a method we call differential fingerprinting. For the sake of simplicity we do not illustrate this here; however, the results would be equivalent. We next describe the quantitative estimation of the number of spiked cells. The total number of events in a spiked sample is the sum of the spiked Pool no. 1 events and background Pool no. 2 events. N ¼ N1 þ N2 :
ð1Þ
This superposition of event counts holds for every subregion of cytometric parameter space. In particular, if the spiked sample is used as the template and bins are formed using oblique recursive subdivision, for each bin i, the total number of events Ti is Ti ¼ Xi þ Yi
ð2Þ
where Xi and Yi are, respectively, the numbers of Pool no. 1 and Pool no. 2 events in bin i. By design, the number of events in a bin (at a given resolution) for the template instance is Ti 5 N/nB, where nB is the number of bins at the given resolution. From Eq. (2), the number of Pool no. 1 events is Xi ¼
N Yi nB
ð3Þ
The total number of spiked events in a sample can be found by summing over the set of bins S containing spiked cells. X X N N1 ¼ ð4Þ Xi ¼ Yi : nB i2S i2S The number of Pool no. 2 cells in each bin can be estimated from the cytometric fingerprints of the control samples because these contain only Pool no. 2 events. Aggregating the control samples and binning relative to the spiked sample template, the fraction of Pool no. 2 cells in each bin is given by fi ¼
ki ; mN
ð5Þ
where ki is the event count value for bin i obtained from the fingerprint of the aggregated control samples. The denomina435
ORIGINAL ARTICLE tor in this expression is the total number of events in the aggregated control sample which is m (the number of aggregated Pool no. 2 tubes) times the number of events, N, in each tube. Using Eqs. (1) and (5), the number of Pool no. 2 events in bin i is Yi ¼ fi N2 ¼ fi ðN N1 Þ ¼
ki ðN N1 Þ mN
ð6Þ
Combining Eqs. (4) and (6), the fraction of spiked events in an experimental sample is P mN S =nB i2S ki N1 P ¼ ð7Þ N mN i2S ki where |S| is the number of bins containing spiked cells. The remaining task is to determine the set of bins S encompassing the region of localization of spiked cells. Because the template bins were created using spiked samples, if there is a sufficient concentration of Pool no. 1 cells, one expects the Pool no. 2 control sample fingerprints to have abnormally low event counts in these bins. For example, in the experiment with 0.1% spike concentration, there are 100 Pool no. 1 cells out of the 100,000 events acquired from the tube. At a resolution of 8,192 bins, there will be on average 12.2 events per bin. Thus, if the Pool no. 1 cells are well localized, one would expect on the order of eight template bins to correspond to this region in parameter space and that the control sample fingerprints should have low event counts in these bins. These low event count bins show up most clearly if the control sample fingerprint values are sorted in order of ascending bin event count. For spiked samples with a sufficient spike population, the sorted control sample fingerprint will have several zero values followed by a rapid rise corresponding to bins associated with the background of Pool no. 2 events. The position of this change in slope is most easily determined using the derivative of the sorted fingerprint values. The first maximum in the derivative of the sorted fingerprint signal indicates the point where bins dominated by spiked events transition to bins dominated by background events. The selection of bins by the method outlined earlier is somewhat conservative in that we pick the set of bins clearly dominated by Pool no. 1 events, excluding some bins that contain a mixture of a small number spiked events and a larger number of background events. This results in an underestimation of the spiked event concentration. However, this underestimation can be corrected by taking advantage of the multiresolution nature of CF. At higher resolutions a larger number of low event-count bins pack more tightly around the cluster of spiked events and thus are able to more clearly delineate between spiked and background events. Lower resolution bins contain more events but are broader and may overlap the spiked and background regions. Our bin selection strategy is to use the highest resolution bins to determine the cutoff threshold for the number of Pool no. 2 events allowed in a bin and then apply this threshold to the lower resolutions. This is analogous to using spheres of successively smaller radii to esti436
mate the volume of a container. Utilizing the threshold determined from the highest resolution corresponds to considering only spheres (at any resolution) that are contained within the volume to the same degree as the smallest spheres. Each resolution of binning provides an estimate of the spiked cell concentration. The form of a function useful for estimating the behavior of concentration as a function of resolution can be obtained by geometrical arguments. As the binning resolution increases, the fraction of bins overlapping spiked and background regions decreases. Consider the fraction of bins that lie on the boundary of the spiked region relative to the number of bins interior to the region. Letting h represent the characteristic linear size (scale) of the bins, the number of bins at the boundary is proportional to S/h(d21) while the number of bins interior to the spike region is V/hd, where S and V represent the surface area and volume of the region and d is the dimensionality of the space. The fraction of bins on the boundary of the spike region is given by
f ¼
S=hd1 /h V =hd
which is proportional to the scale of the bins. The scale of the bins decreases by a factor of 2 for each increase in resolution level. Thus, the fraction of bins on the boundary decreases geometrically with resolution hierarchy index and can be represented as, f e2cL. Therefore, the estimated concentration level asymptotically approaches the true value and can be represented as CðLÞ / ð1 af Þ ¼ C0 ð1 ae kL Þ
ð8Þ
C0, a, and k are parameters that can be fit from the series of multiresolution estimates. The parameters a and k are proportionality factors that depend on the geometry of the spike distribution and dimensionality of the space while C0 may be interpreted as the asymptotic concentration as the resolution approaches infinity. High-resolution (8,192 bins) fingerprints for the aggregate of the control Pool no. 2 cells binned relative to each spike concentration sample are shown in Figure 5. Low event count bins are clearly visible for the high spike concentration experiments. As the spike concentration decreases, the number of bins with low event density falls. The limit of detection is at the 0.01% level corresponding to 10 cells out of 100,000. This is an expected result because, for the highest resolution, the best one could do is to form a bin around a cluster of 12 spiked events. Table 1 summarizes the results of the spiking experiments for levels of resolution corresponding to 1,024, 2,048, 4,096, and 8,192 bins. It also shows the asymptotic concentration using the method described earlier. Spike concentrations less than 0.03%, were detectable at only the highest level of resolution and so could not be extrapolated. The table shows good agreement between the event counts from population gating and those obtained by CF. Cytometric Fingerprinting
ORIGINAL ARTICLE
Figure 5. Cytometric fingerprints from spiking experiment. Fingerprints corresponding to two different spike concentrations are shown for illustrative purposes. The spiked samples were used as the template instances to determine bins, and the fingerprints represent the aggregate of the unspiked pure Pool no. 2 population (as described in the text). (a) Fingerprints for a nominal spike concentration of 1%. The upper graph shows the 8,192 highresolution bins of the fingerprint. The bins shown in black represent the location corresponding to the spiked events. The lower graph shows the same data, zoomed to show better that there are many bins corresponding to the location of spiked events in which the reference instances lack events. (b) As in (a) but for 0.1% spiking concentration. Note that the number of black bins is much smaller than that at the higher 1% spike concentration.
Analysis of Synthetic DataDetection of Rare Events CF can be used for a variety of objectives in the analysis of flow cytometric data. Here we show that CF is able to detect and identify the presence of a small number of events against a large overlapping background distribution, and that it can do so in the presence of significant ‘‘biological’’ variability. This is similar to the application of PB termed Frequency Difference Gating (12). Using the R Statistical Computing Environment, we created a series of two synthetic populations consisting of 100 Cytometry Part A 73A: 430441, 2008
instances each of simulated flow cytometric data. The two populations are termed ‘‘normal’’ and ‘‘disease.’’ Each instance in each population consists of 100,000 events described in a four-dimensional space. Each event is generated from one of five equally represented 4D hyperellipsoidal distribution functions whose centroids (Ci) and standard deviations (Si) were assigned using a random number generator within predefined limits corresponding to a typical log-linear parameter range of 0–1,024 log fluorescence intensity. For the disease instances, a sixth cluster was added with n events. (An equal number of n events were removed at random from the other five clusters such that the total number of events in each data set remained unaltered.) A series of disease populations was generated, varying n from a minimum of 10 events (corresponding to 0.01% disease events) to a maximum of 1,000 events (corresponding to 1% disease events). The actual centroids for each hyperellipsoid (including the disease hyperellipsoid) for each instance were randomly varied with a standard deviation of 0.08 log10 units. The standard deviations were randomly varied for each instance with a standard deviation of 0.04 log10 units. Thus, no two instances had the same distributions, and there were marked visual differences in the distribution as judged from pairwise comparisons of the dotplots (data not shown). The locations of the hyperellipsoidal functions were such that the disease population was significantly intermingled with the normal population along all parameter axes. Thus, with no prior knowledge of the location of the disease-indicating events, at the lower disease event concentrations (0.1% and lower) it is not possible to visually detect their presence. Consequently, conventional sequential gating analysis in such a situation cannot be applied. For a given population of 100 instances each of disease and normal, we further divided the instances randomly into equal training and testing sets. The events in the normal training instances were aggregated, and CF template bins were determined from this aggregate to a resolution of 8,192 leaf bins (16,383 bins in total). Cytometric Fingerprints were then individually computed for all 200 instances in the population with respect to this template. Fingerprints were computed from the aggregated disease training instances and a procedure similar to that described in the previous section was used to automatically select significant features. Figure 6 shows a set of dotplots for one of the disease training instances (0.1% disease events) selected at random. Disease events are shown in red. The circles show the known centroid of the disease cluster. The events that fall within the informative fingerprint bins group closely about the location where the disease events were located in this instance, demonstrating that CF indeed identifies the multiparameter locus of significance in this data set. Moreover, for the 0.1% disease case, on average about 50% of the events that fall within the informative leaf bins are actual disease events, which can be determined with certainty in synthetic data. This represents an enrichment factor of about 5003, showing that CF can be an effective means for sorting rare events. The detection task is equivalent to determining which instances are normal and which are disease by means of their 437
ORIGINAL ARTICLE Table 1. Estimate of spiked cell concentration by CF a
ESTIMATED % SPIKED CELL CONCENTRATIONb
% SPIKED CELL CONCENTRATION NOMINAL
GATING
1,024 BINS (L 5 10)
2,048 BINS (L 5 11)
4,096 BINS (L 5 12)
8,192 BINS (L 5 13)
ASYMPTOTIC (L ? 1)
0.001 0.003 0.010 0.032 0.100 0.316 1.000 3.162 10.00
0.002 0.003 0.019 0.041 0.170 0.530 1.950 4.720 13.80
– – – – 0.097 0.390 1.757 4.491 13.379
– – – – 0.145 0.488 1.806 4.539 13.525
– – – 0.024 0.145 0.488 1.830 4.588 13.622
– – 0.012 0.036 0.157 0.512 1.866 4.610 13.695
– – 0.012 0.036 0.157 0.513 1.981 4.697 13.854
a Column 1 5 nominal spike concentration; Column 2 5 actual spiked cell concentration determined by population gating with sequential gates corresponding those used for sorting. b Columns 3 through 6 give the results of cytometric fingerprint analysis at four binning resolutions; Column 7 shows the concentration estimated by using the four fingerprint resolutions to extrapolate to infinite resolution.
Cytometric Fingerprints alone. For this purpose, we utilized the selected features for the set of simulated instances at 0.1% disease event concentration to perform supervised classification. We chose linear discriminant analysis (LDA) (function lda in package MASS of the R statistical computing environment) because it is perhaps the simplest method of classification and its use here is intended to illustrate that the cytometric fingerprints are in fact informative with respect to phenotype.
Utilizing the training set only, an LDA classifier was trained. This model was then used to predict the remaining 100 instances in the test set in a blinded fashion. The class membership of these instances was then unblinded to evaluate the classification accuracy. This procedure was repeated for 100 random draws (a random draw is a random selection of 50 of the 100 instances of each class to act as the training set, the remaining 50 being the test set), each time building a fresh LDA classifier based upon only the training instances in the
Figure 6. Synthetic data. (a) Example of one of the ‘‘normal’’ instances. (b, c) Example of one of the ‘‘disease’’ instances, with 50 disease cluster events spiked into 100,000 events. In (b) the disease events are present but not marked, and in (c) they are marked in red. (d) Same instance as in (b, c), but all six projections are shown to indicate that the 23 projection is the one where the disease cluster is best separated from the background. (e, f) Another example of a disease instance. The regions encircled in (b, c) and (e, f) indicate the general location of the disease clusters. Note that the disease cluster is intermingled with the background. Comparing (a), (b), and (e) it is possible to discern the fact that no two instances are exactly the same, as described in the text.
438
Cytometric Fingerprinting
ORIGINAL ARTICLE draw and evaluating its accuracy using only the test instances for that draw. The average classification accuracy at 100/ 100,000 disease event concentration was 97.74%/98.5% (sensitivity/specificity). This accuracy degraded to 94.02%/93.98% and 67.30%/67.68% at disease event concentrations 50/ 100,000 and 20/100,000, respectively. (At concentrations above 100/100,000 the classifier was perfectly accurate.) At 10/ 100,000 disease event concentration the LDA classifier performed only slightly better than the 50% null hypothesis, demonstrating that an effective lower limit of detection had been reached for this hypothetical ‘‘disease.’’ That this performance is achieved with a very simple classifier suggests that CF is an efficient method of representing differences among a set of multivariate probability density distributions. How well does the simplest possible detection method work on these data? To answer this question we divided the 50/100,000 concentration training data into uniform bins at a resolution of 10 divisions on each of the four dimensions. We then computed Mann–Whitney P values for each bin location and found that three bins had P values less than 1024. Counting events in these bins and averaging for the disease and normal groups separately, we detected on average only 12.97 (25.9%) of the 50 spiked events. CF detected on average 27.04 (54.1%) disease events in two high-resolution bins, indicating that CF creates a more efficient representation than uniform binning, in that fewer bins capture more information. PB by comparison detected on average 25.1 (50.2%) disease events in one lower resolution bin. Thus, for this comparison, CF and PB perform comparably, and both perform significantly better than uniform binning. Quality Control Based on CF CF has been used to quantitatively assess the consistency of gating across data derived from a panel of tubes. Here we illustrate CF’s ability to complement rather than replace conventional methods of sequential gating data analysis. Often, clinical flow cytometric assays involve the acquisition of data from several tubes in order to accommodate many markers or stimulation conditions. Analysis of multitube data frequently involves the establishment of gates (based on a subset of parameters common to all tubes) which can be uniformly applied across the entire panel, thus providing a basis for consistent population identification and minimizing time and effort in the data analysis process. The underlying assumption is that events are distributed the same way for each tube across these gating parameters. Violations of this assumption due to unanticipated and undetected shifts or other dissimilarities of the gating data may result in inaccurate phenotyping. CF-based quality control assessment was applied in the context of a 96-well plate format flow cytometric experiment. In this study (19), fresh PBMCs were added to an array of antigenic stimuli and a set of controls in a 96-well plate format. Cells were allowed to respond to the stimuli for 6 h. Rows A– D of the plate contained IFN-g FITC, CD69 PE, CD4 perCPCy5.5, and CD3 APC, whereas rows E–H contained TNF-a FITC, IL-2 PE, CD4 perCP-Cy5.5, and CD3 APC. Responses Cytometry Part A 73A: 430441, 2008
and the cell types of the responders were then measured by gating on the common parameters including Side scatter, CD4 perCP-Cy5.5, and CD3 APC. To compare the gating distributions, cytometric fingerprints were computed on the three dimensions common to all 96 wells at a resolution of 32 bins by binning on the aggregated gating data from all 96 wells, projecting the distribution from each well onto the resulting template bins, and expressing the PDF for each well as a vector of the base-2 logarithm of the fractional deviation from the norm. Figure 7 shows examples of these fingerprints for each well, along with the maximum deviation for each well. Log2 (maximum deviation) values less than 2.0 are color-coded green, between 2.0 and 3.0 in yellow, and greater than 3.0 in red. Bins coded yellow and red clearly indicate disruptions of the characteristic CD3/CD4 distributions as shown in Figure 7c. This type of CF-based analysis provides a quick visual triage of possible data acquisition artifacts that can help the investigator recognize and correct defects.
DISCUSSION In this report we describe CF, a new method for unbiased quantitative analysis of list-mode cytometric data. This technique relies on a method of subdividing a multidimensional space which extends that of PB. CF includes a novel algorithm for space subdivision that eliminates the constraint in PB of subdivision along the coordinate axes by dividing multidimensional space into nonrectangular polytopes. Additionally, it renders a multidimensional distribution in the form of a one-dimensional multiresolution hierarchical fingerprint that creates a computationally efficient representation of highdimensionality distribution functions. Each event in list-mode data is described by a vector of coordinates in a multidimensional space. Thus, a complete mathematical description of a sample is the multivariate probability distribution function defining the density of events in this space. This distribution may be approximated by dividing the space into small volume elements, counting the number of events in each volume element, and normalizing the count by the total number of events in the sample. In the limit of an infinite number of events, the regions may be made infinitesimally small yielding the true probability distribution function. Of course, it is impossible to collect an infinite number of events, and so the question of interest is how does one accurately estimate the true probability distribution from a finite sample of events? Equally importantly, how does one represent this approximation of the multivariate probability density function in a form amenable to comparing disparate samples? The most common nonparametric means of estimating a PDF is a histogram where space is divided into equal width bins (20). For a complex (rapidly changing) PDF, one would like to choose small bins in order to accurately track the variation with respect to independent variables (low bias). On the other hand, one would like to choose bins of sufficient size to contain a large number of events in order to estimate the value of the density within a bin with high accuracy (low variance). This trade-off between number of bins and bin size is the 439
ORIGINAL ARTICLE
Figure 7. Quality control assessment for gating data. (a) Fingerprints for 96-well plate data; (b) upper-left corner of the plate enlarged; (c) individual dotplots for CD3 vs. CD4 for three selected wells, superimposed on contours for the aggregated event density distribution (blue contours). White arrows in (c) for the dotplots corresponding to wells B-1 and C-12 indicate locations where event density distributions deviate from the plate norm.
classic bias-variance dilemma (21). For one independent variable and reasonably sized datasets, it is not difficult to balance the bias-variance requirements. However, for multidimensional data one approach is to control the variance by choosing bins that contain equal numbers of events. In the case of univariate data, there is a unique set of bin boundaries that accomplishes this goal (22). For multivariate data, however, there is not a unique solution. While this indeterminacy might seem like a disadvantage, in fact, it creates an opportunity to find a specific set of bin boundaries that does a superior job of reducing bias. Other methods of representing and analyzing multidimensional flow cytometry data have been developed (8,23– 25). One that is most closely related to the present work is PB (7,10,12,22). PB represents a multidimensional probability distribution as a set of bins defining regions of the multidimensional space. The boundaries of these bins are chosen so that approximately equal numbers of events lie in each bin. Bins are found by selecting a coordinate dimension, determining the median in that coordinate, and dividing the data at the median value. In PB, the axis selection is made by calculating the variance of the data in the parent bin for each of the original coordinate dimensions and choosing the one dimension having the largest variance. Although the decision is made on the basis of the variance in each dimension, the split is not necessarily along the optimal direction because the direction of maximum variance may not coincide with one of the coordinate axes. The method of binning utilized in CF differs from that of PB in that bins are formed by splitting the data in the direction of maximum variance rather than along the original coordinate axes. A primary objective of CF is the development of a general and efficient representation of the multivariate 440
probability density function embodied in list-mode data that can be utilized with machine learning methods to analyze cytometric data. Key features of this representation are that it retains information at multiple resolutions and is in the form of a one-dimensional feature vector. As evidenced from the controlled spiking experiments, CF compares well with proven methods of data analysis (e.g., sequential gating) in cases where the target cell population is discrete and easily recognized. In a more challenging setting illustrated by the synthetic data example, CF is able to detect and identify events in complex, overlapping distributions that are not visible to the human eye in any 2D dotplot and are thus not amenable to methods of analysis that are graphically/ visually guided. Furthermore, even at a level below 0.05% anomalous events in an ungated population, in which the anomalous events are not even detectable via conventional methods, cytometric fingerprints were able to efficiently represent the data such that a simple classifier (in this case LDA) was able to accurately and predictively classify instances into distinct ‘‘phenotypes’’ based upon the raw list-mode data alone. Importantly, CF does not depend upon arbitrary choices of gating or analytical thresholds, which collectively represent a priori hypotheses. Rather, CF provides a means to generate new hypotheses directly from the data, as well as to test existing hypotheses by direct comparison with traditional, proven methods of data analysis. It is important to note that as the numbers of parameters involved in flow cytometric studies increases, so will the complexity of the resulting distributions of events. Thus, there will be a concomitant demand for increasing the number of events collected. This is a fundamental requirement for multiparameter flow cytometry in general, and not just for the applicaCytometric Fingerprinting
ORIGINAL ARTICLE tion of fingerprinting or other advanced methods of quantitative analysis. It is a result of the geometric dilution of the significance of any individual dimension into a ‘‘sea’’ of increasing dimensionality. CF seeks to maximize the utility of data characterized by a given number of events at a given dimensionality by optimizing the trade-off between bias and variance. Thus, although it might be argued that CF allows for the collection of the minimum number of events for a given dimensionality, we emphasize the need to collect an adequate sampling of events given the dimensionality, and then suggest the use of CF to maximize the utility of those data. CF demands adherence to strict, standardized operating procedures in the collection of flow cytometric data, and it can help in the development of such procedures. CF, like other quantitative methods of analysis of flow cytometric data, is sensitive to variations that exist in the data whether due to a true biological effect or artifact. In fact, as shown earlier and in Ref. 18, CF can be used to detect and correct artificial sources of variability. Improved procedures for collecting data will enable CF to be more sensitive to true biologic variations, thus deriving phenotype hypotheses that can then be tested. It is important that the quantitation of each parameter represent as accurately as possible its corresponding physical counterpart (e.g., the CD45-FITC fluorescence signal for an event should reproducibly represent the copy number of CD45 molecules on the surface of the corresponding cell). Methods for collection, storage, and transport of samples, variations in staining and other sample preparation variables, and instrument performance parameters can all contribute to small but important variations in the distribution of events in spaces of ever-increasing dimensionality. In difficult applications, for example, the characterization of rare event distributions, small systematic errors may completely mask small but biologically significant effects. Therefore, standardized procedures, including appropriate controls, for each step in the chain of events leading from donor/patient to list-mode file are essential. While the need for such rigor may seem to be a limitation of CF, it is equally significant in any other methods of quantitative analysis of list-mode data. The CF method together with the adoption of such standardized protocols will result in a number of tangible and farreaching benefits. First, the application of quantitative analysis methods such as CF reduces or eliminates tedious manual analysis procedures. This not only increases analysis throughput via automation but also removes an important source of analytical bias; namely, subjective decisions (such as choosing gates) made by a human analyst. Second, the incorporation of additional parameters into an assay, which potentially leads to a large increase in the number of steps involved in manual analysis, does not increase the apparent difficulty of analysis via CF, because data of any dimensionality are computationally reduced to a one-dimensional fingerprint. Therefore, studies with larger dimensionality are made more practical. Third and perhaps most significant, data that are collected according to standardized and controlled protocols can be integrated together with their fingerprints into a large-scale cytometry database. Such an archive will be a valuable resource for retroCytometry Part A 73A: 430441, 2008
spective data mining for the creation and testing of new hypotheses for disease associations or biological mechanisms.
ACKNOWLEDGMENTS Flow cytometry was performed at the Abramson Cancer Center Flow Cytometry and Cell Sorting Shared Resource of the University of Pennsylvania School of Medicine, which was established in part by equipment grants from the NIH Shared Instrument Program, and receives support from NIH P30 CA016520 from the National Cancer Institute. We thank the authors of Ref. 19 for generously making their 96-well plate data publicly available.
LITERATURE CITED 1. Wille-Reece U, Flynn BJ, Lore´ K, Koup RA, Miles AP, Saul A, Kedl RM, Mattapallil JJ, Weiss WR, Roederer M, Seder RA. Toll-like receptor agonists influence the magnitude and quality of memory T cell responses after prime-boost immunization in nonhuman primates. J Exp Med 2006;203:1249–1258. 2. Neeson P, Paterson Y. A new multi-parameter flow cytometric assay for monitoring lymphoma growth and spread in a pre-clinical murine model for human lymphoma. Cytometry Part A 2004;60A:8–20. 3. Precopio ML, Betts MR, Parrino J, Price DA, Gostick E, Ambrozak DR, Asher TE, Douek DC, Harari A, Pantaleo G, Bailer R, Graham BS, Roederer M, Koup RA. Immunization with vaccinia virus induces polyfunctional and phenotypically distinctive CD8(1) T cell responses. J Exp Med 2007;204:1405–1416. 4. Wu KH, Zhou B, Mo XM, Cui B, Yu CT, Lu SH, Han ZC, Liu YL. Therapeutic potential of human umbilical cord-derived stem cells in ischemic diseases. Transplant Proc 2007;39:1620–1622. 5. Zhao X, Huang L, Yin Y, Fang Y, Zhou Y. Autologous endothelial progenitor cells transplantation promoting endothelial recovery in mice. Transplant Int 2007;20:712–721. 6. Shaffer RG, Greene S, Arshi A, Supple G, Bantly A, Moore JS, Parmacek MS, Mohler ER3. Effect of acute exercise on endothelial progenitor cells in patients with peripheral arterial disease. Vasc Med 2006;11:219–226. 7. Roederer M, Moore W, Treister A, Hardy RR, Herzenberg LA. Probability binning comparison: a metric for quantitating multivariate distribution differences. Cytometry 2001;45:47–55. 8. Lugli E, Pinti M, Nasi M, Troiano L, Ferraresi R, Mussi C, Salvioli G, Patsekin V, Robinson JP, Durante C, Cocchi M, Cossarizza A. Subject classification obtained by cluster analysis and principal component analysis applied to flow cytometric data. Cytometry Part A 2007;71A:334–344. 9. Lizard G. Flow cytometry analyses and bioinformatics: interest in new softwares to optimize novel technologies and to favor the emergence of innovative concepts in cell research. Cytometry Part A 2007;71A:646–647. 10. De Rosa SC, Brenchley JM, Roederer M. Beyond six colors: a new era in flow cytometry. Nat Med 2003;9:112–117. 11. Jackson J. A User’s Guide to Principal Components. New York: Wiley-Interscience; 1991. 12. Roederer M, Hardy RR. Frequency difference gating: a multivariate method for identifying subsets that differ between samples. Cytometry 2001;45:56–64. 13. O’Connell M. Search program for significant variables. Comput Phys Commun 1974;8:49–55. 14. Pearson K. On lines and planes of closest fit to systems of points in space. Philos Mag 1901;2:559–572. 15. Jolliffe I. Principal Component Analysis. New York: Springer; 2002. 16. Hotelling H. Analysis of a complex statistical variable into principal components. J Educ Psych 1933;26:417–441, 498–520. 17. Arnold G, Collins A. Interpretation of transformed axes in multivariate analysis. Appl Statist 1993;42:381–400. 18. Rogers W, Holyst H, Moore J, Schretzenmair R, Moser A. Qtube: automated quality assessment of flow cytometric data based on cytometric fingerprinting. Cytometry Part B 2007;72B:490. 19. Inokuma M, dela Rosa C, Schmitt C, Haaland P, Siebert J, Petry D, Tang M, Suni MA, Ghanekar SA, Gladding D, Dunne JF, Maino VC, Disis ML, Maecker HT. Functional t cell responses to tumor antigens in breast cancer patients have a distinct phenotype and cytokine signature. J Immunol 2007;179:2627–2633. 20. Sturges HA. The choice of a class interval. J Am Statist Assoc 1926;21:65–66. 21. Liu RC, Brown LD. Nonexistence of informative unbiased estimators in singular problems. Ann Statist 1993;21:1–13. 22. Roederer M, Treister A, Moore W, Herzenberg LA. Probability binning comparison: a metric for quantitating univariate distribution differences. Cytometry 2001;45:37–46. 23. Robinson JP, Ragheb K, Lawler G, Kelley S, Durack G. Rapid multivariate analysis and display of cross-reacting antibodies on human leukocytes. Cytometry 1992; 13:75–82. 24. Robinson JP, Durack G, Kelley S. An innovation in flow cytometry data collection and analysis producing a correlated multiple sample analysis in a single file. Cytometry 1991;12:82–90. 25. Murphy RF. Automated identification of subpopulations in flow cytometric list mode data using cluster analysis. Cytometry 1985;6:302–309.
441