IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 55, NO. 3, MARCH 2008
1155
A Multidimensional Classification Approach for the Automated Analysis of Flow Cytometry Data Carlos Eduardo Pedreira*, Senior Member, IEEE, Elaine S. Costa, M. Elena Arroyo, Julia Almeida, and Alberto Orfao
Abstract—We describe an automated multidimensional approach for the analysis of flow cytometry data based on pattern classification. Flow cytometry is a widely used technique both for research and clinical purposes where it has become essential for the diagnosis and follow up of a wide spectrum of diseases, such as HIV-infection and neoplastic disorders. Flow cytometry data sets are composed of quite a large number of observations that can be viewed as elements of a -dimensional space. The aim of the analysis of such data files is typically to classify groups of cellular events as specific populations with biological meaning. Despite significant improvements in data acquisition capabilities of flow cytometers, data analysis is still based on bi-dimensional strategies which were defined a long time ago. These are strongly dependent on the expertise of an expert operator, this approach being relatively subjective and potentially leading to unreliable results. Automated analysis of flow cytometry data is an essential step to improve reproducibility of the results. The proposed automated analysis was implemented on peripherial blood lymphocyte subsets from 307 samples stained and prepared in an identical way and it was capable of identifying all cell subsets present in each sample studied that could also be detected in the same data files by an expert operator. A highly significant correlation was found between the results obtained by an expert operator using a conventional manual method of analysis and those obtained using the implemented automated approach.
Index Terms—Automation, B-cell chronic lymphoproliferative disorders, cancer, flow cytometry, leukemia, lymphocytosis, pattern classification, vector quantization.
Manuscript received February 3, 2007; revised July 8, 2007. This work was supported in part by grants from the Fondos de Investigación Sanitaria (Ref PI060824), the Spanish Network of Cancer Research Centers (Ref RD06/0020/ 0035), (Instituto de Salud Carlos III/Fondos FEDER, Ministerio de Sanidad y Consumo) and Programa Hispano-Brasileño de Cooperación Universitaria Ref. PHB 2004–0800-PC (Ministerio de Educación y Ciencia), Madrid, Spain, and CAPES/Ministerio da Educação, Brasília, Brazil. The work of E. S. Costa was supported by a grant from FAPERJ, Rio de Janeiro Research Foundation. The work of C. E. Pedreira was supported in part by grants from CNPq, Brazilian National Research Council, Brasília, Brazil, and FAPERJ, Rio de Janeiro Research Foundation. Asterisk indicates corresponding author. *C. E. Pedreira is with the School of Medicine and COPPE-PEE—Engineering Graduate Program, Federal University of Rio de Janeiro (UFRJ), Av. Brigadeiro Trompowski, s/n, Universitária Ilha Do Fundao, Rio de Janeiro 21941972, Brazil (e-mail:
[email protected]). E. S. Costa is with the Clinical Medicine Graduate Program and IPPMG Hospital, Federal University of Rio de Janeiro (UFRJ), Rio de Janeiro CEP 22261–050, Brazil. M. E. Arroyo, J. Almeida, and A. Orfao are with the Cytometry Service, Department of Medicine and Cancer Research Center, University of Salamanca, Salamanca 37007, Spain. Digital Object Identifier 10.1109/TBME.2008.915729
I. INTRODUCTION
F
LOW cytometry is a well established, widely used technique both for research and clinical purposes where it has become essential for the diagnosis and follow up of a wide spectrum of diseases, including mainly HIV-infection and clonal haematological disorders such as acute and chronic leukemias and non-Hodgkin’s lymphomas [1]. A major advantage of flow cytometry consists on fast evaluation of multiple parameters in millions of cells, digitalizated information being stored for each cell measured. Such multiparameter flow cytometry analyses typically allow for an accurate identification and characterization of neoplastic cells in a sample providing essential information for the diagnosis classification and decision making process in individual patients; at the same time, they allow the identification of populations of neoplastic cells present at very low in a sample, among a major population of frequencies 10 normal cells [2]. In order to reach these goals, large data sets are quickly (in a few seconds) generated. Accordingly, information about six or more cell-associated parameters is typically generated for several tens or hundreds of thousands of cells measured. The generated information is stored for each event (cell) in a standardized FCS list mode format; overall, this means typically between 10 and 10 individual data points, a number of entries which is about 25 times larger than a typical data set containing information about a sample analyzed by DNA microarray techniques [3]. From the engineering point of view, flow cytometry data sets are composed of quite a large number of observations—tens of thousands to millions—that can be modeled as elements of the space. Analysis of such data files typically searches the classification of all groups of cellular events into specific populations with biological meaning. Despite significant improvement in data acquisition capabilities of flow cytometers, data analysis is still based on strategies which were defined more than 20 years ago [4]. Accordingly, analysis of flow cytometry data is typically based on the definition of a variable number of bi-dimensional plots, where an experienced operator selects the subpopulations of interest [1]. Often, depending on the expertise of the operator, specific cell populations, particularly those present at low frequencies, can be misidentified. In this regard, it should be noted that such approach is relatively subjective since the operator is assumed to know where to look for the cells of interest space. It is important to be aware that overesin timation and/or underestimation of specific cell populations in a sample may have a negative impact on diagnosis. Thus, automated analysis of flow cytometry data is an essential step to
0018-9294/$25.00 © 2008 IEEE
1156
IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 55, NO. 3, MARCH 2008
In the last decade, multiparameter flow cytometry immunophenotyping has become the method of choice for the differential diagnosis between reactive (e.g., due to an infection) and neoplastic (e.g. due to a malignant tumor) absolute lymphocytosis [7]. For this purpose, three- and four- color single-tube combinations of up to seven monoclonal antibodies have been proposed, which allow the identification and enumeration of up to 13 different populations of lymphocytes that may be present in a PB sample [8]. Typically, in these analyses information about six or more attributes of several tens of thousands of cells are measured and digitally stored in list mode data files, in an FCS format. B. On the Dataset
Fig. 1. Schematic representation of how the light source (laser) awards the flow chamber of a flow cytometer and how light-based flow cytometry—associated parameters are generated after the interaction of the laser light with the single cell sample flow.
improve reproducibility of the results, at the same time it will speed-up the screening of high numbers of samples. In the present paper, we describe an automated multidimensional approach for the analysis of cytometry data based on vector quantization (VQ) and previously well established biological knowledge. Here, VQ is used in the sense of building up a quantized approximation to data distribution using a finite number of prototype vectors [11]. The prototypes had been manually established using biological knowledge on the in-sample observations. Here, we describe the problem from an engineering perspective, while its medical application has been recently reported [5]. II. METHODOLOGY AND DATA A. Flow Cytometry Cytometers are instruments capable of analyzing liquid samples containing cells such as peripheral blood (PB). These fluids are submitted to a laminar flow, forcing the cells to pass one-by-one into a very narrow capillary where they are interrogated by the light of one or more lasers; changes in both the direction of the laser light due to light refraction (forward light scatter or FSC) and light reflection at 90 (sideward scatter or SSC) as well as in its wavelength due to the presence of fluorochromes in the cell can be inferred for each cell in the sample (Fig. 1). The amount of FSC and SSC directly reflect the size and internal complexity of the cells, respectively. In turn, the fluorescence emissions measured, are usually due to staining of the cell with specific fluorocrome-conjugated monoclonal antibodies which recognize proteins expressed by specific populations of cells, and may be excited by the laser light to emit fluorescence. Currently available clinical flow cytometers may have up to nine different fluorescence detectors that allow to simultaneously evaluate the expression of several proteins for each cell analyzed [4].
The data set used in the present paper included flow cytometry data obtained through the measurement of six different parameters for 60 000 events (PB cells), for a total of 307 samples corresponding to an identical number of adult individuals with either normal/reactive or increased lymphocyte counts in a routine blood cell analysis; 230 cases corresponded to normal or reactive samples (abnormally increased number of nonneoplastic lymphocytes) as defined by the absence of any expanded population of clonal lymphocytes and 77 to neoplastic disorders of mature B-lymphocytes (B-cell chronic lymphoproliferative disorders; B-CLPD) diagnosed according to the World Health Organization criteria for hematological neoplasias [6]. Sixty-eight cases had B-cell chronic lymphocytic leukemia (B-CLL); five, mantle cell lymphoma (MCL), one had marginal zone splenic lymphoma, one a follicular lymphoma, one a MALT lymphoma, and one a nonclassifiable B-CLPD. The flow cytometry data sets used in the present study correspond to data sets stained with the following combination of four-color—fluorescein isothiocyanate (FITC)/ phycoerythrin (PE)/peridinin chlorophyll protein- cyanin 5.5 (PerCP-Cy5.5)/allophycocyanin (APC) -monoclonal antiCD56 plus bodies: CD8-plus surface imunoglobulin CD4 plus CD19/CD3 [5]. All individuals gave their informed consent prior to entering the study, and the study was approved by the local Ethical Committee of the University Hospital of Salamanca (Salamanca, Spain). C. Multidimensional Classification Approach for the Analysis of Flow Cytometric Data Sets Concerning notation, throughout the paper we denote scalars and vectors with small letters and matrices with capital letters. represent the six measured attributes of the th Let cell event (observation). Two of these attributes are data about the light scatter (FSC and SSC) and four are measurements of emissions of different wavelengths associated with staining for specific proteins with monoclonal antibodies conjugated with four different fluorochromes: (fluorescein isothiocyanate and associated green fluorescence, corresponding to the CD8 proteins; phycoerythrin-associated orange fluorescence, and CD56 proteins; peridinin chlorophyll proteinfor the cyanin 5.5-associated red fluorescence, corresponding to the CD4 and CD19 proteins; and allophycocyanin-associated deep red fluorescence, corresponding to the CD3 protein). A first
PEDREIRA et al.: MULTIDIMENSIONAL CLASSIFICATION APPROACH
goal is to assign each attributes’ vector , corresponding to the th cellular event, to one of the “group-of-events.” Here, we define a “group-of-events” as a set of vectors that have similar characteristics, such “group-of-events” not necessarily having a biological meaning. In turn, a “population” is defined as a single “group-of-events” or an assemblage of “groups-of-events” that has a biological significance. Since these populations may be hard to model directly, we apply the “divide-to-conquer” rule by considering populations as collections of more simple “groups-of-events.” This will be particularly useful because, in the sequence described, we will assign prototypes to these “groups-of-events,” which is equivalent to assigning multiple prototypes to the more complex population. For instance, the granulocytes form a population that was modeled as an assemblage of two groups-of-events without biological meaning. A possible approach is to place a set of prototypes and associate each groups-of-events. Each cellular of these vectors to one of event is then assigned to the group represented by its nearest, in an Euclidean metric, prototype [9]–[11]. Therefore, an event belongs to a “group-of-events” “ ” if, and only if
However, by doing it this way, each “group-of-events” is intrinsically assumed to be a spheroid; and as a consequence, existing differences in dispersions in distinct directions would not be taken into consideration. In order to overcome this drawback, we abandoned the Euclidean metric in favor of the Mahalanobis distance by associating a multivariate Gaussian function to each group-of-events. Differences in dispersions in different directions are now taken in to consideration by appropriately setting the dispersion matrices. Accordingly, a cellular event is not necessarily allocated to the “group-of-events” represented by its nearest (by the Euclidean metric) prototypes. An illustrating example of this is shown in Fig. 2. In order to preserve simplicity in this example, we considered a projection on the FSC and SSC plan and assume that the granulocyte and lymphocyte populations are formed by just one “group-of-events” each, although in practice they are actually formed by more than one in Fig. 2 is closer to the “group-of-events.” Although event lymphocyte population according to the central tendency measure, it is set to belong to the granulocyte population because of the shape of the dispersion of this latter cell population. By the Mahalanobis metric, events and in Fig. 2 are at the same distance from the mean of the granulocytes population. By the Euclidean metric, and would have an identical probability of belonging to the granulocyte population (if we considered their distance to the mean of this population), what is clearly a nonsense. be a Gaussian function paFormally, let (:, p, S): rameterized by ,. the mean, and , the covariance matrix. Then, for each event
1157
Fig. 2. Representative example of the distribution of the populations of granulocytes and lymphocytes from a normal peripheral blood sample in a FSC versus SSC plot. Events represented as e , e , and e belong to the granulocyte population while event e does not.
of these functions, , In fact, we have , corresponding to “group-of-events” paand the rametrized by the mean vectors . These parameters were covariance matrices all set by experts, based on well-established biological knowledge and on sample inspection using the in-sample observations only. After this learning phase, the parameters do not change any more, meaning that the operator do not have to have expertise concerning the settings for parameters. Of course, in the out-of-sample validation phase, we kept all the in-sample settings unchanged. For the learning (in-sample) phase we used 20 randomly selected data sets corresponding to ten normal, five reactive, and five B-CLL PB samples. In the validation (out-of-sample) phase, 307 data sets corresponding to 198 control PB samples and 109 PB samples with absolute lymphocytosis were used. All reported results refer to the validation phase. In the out-of-sample phase, each cellular event is assigned to the “group-of-events,” for which it reaches the highest value among all Gaussian functions, i.e.: A cellular event “ ” belongs to a “group-of-events” “ ” if and only if
for Finally, we implemented a scheme to remove debris by imposing a cutoff value for each Gaussian function. Those events following below this value were considered to be noise and discarded since the chance, that they would belong to any of the cutoff for all groups is very low. Formally, if , then event “ ” is considered to be noise and discarded. During analysis of data for different applications of clinical cytometry, one has to be extremely careful with the concept of “outliers” since in some situations the population of interest only represents a minority of all cellular events. On the other hand, almost all problems addressed by clinical cytometry are very well known from the biological point of view, so that
1158
IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 55, NO. 3, MARCH 2008
Fig. 3. Representation of the automated algorithm used for the identification of the different subpopulations of PB lymphocytes. In each step, the groups of events/populations listed are defined together with the aim of the statistical analysis applied.
experts have secure knowledge on those regions from the featured-markers space were it would not be possible to have events with biological meaning. So, we set up the cutoffs to eliminate events that are securely far way from the region of interest. D. Automatic Search for Malignant Cell Populations We focused at the identification of subpopulations of lymphocytes, the most frequent use of the flow cytometry in clinical laboratories. The specific panel of monoclonal antibodies used in this paper is currently considered one of the best strategies for the diagnostic screening of lymphocytosis. It is important to note that in all clinical cytometry problems, the ability to discriminate between different of cell populations is directly linked to the choice and use of an appropriate panel of monoclonal antibodies. The chosen panel should be able to make evident the existing differences in the patterns of protein expression among the involved cell populations. For instance, in a normal bone marrow sample, three B cell subpopulations may be identified if one uses an anti-CD20 antibody conjugated with phycoeritrin for the identification of this protein. In contrast, if the same sample is stained with the same antibody clone conjugated with fluorescein isothiocyanate, only two B cells subpopulations will became apparent. An increase in the number of cells in one or more of the different subpopulations of lymphocytes may be due to either an infectious/ inflammatory disease or a tumoral process. In fact, the search for abnormal populations of lymphocytes had two goals: 1) In patients with increased number of lymphocytes (named lymphocytosis), to determine which subpopulation(s) of lymphocytes is(are) responsible for this alteration; and 2) to search for a neoplastic subpopulation of lymphocytes, even when the number of lymphocytes is not increased. Any of the
subpopulations of lymphocytes present in a sample may be responsible for the lymphocytosis, and any of them could have a tumoral nature. This is why it is important to simultaneously identify and, quantify all subpopulations of lymphocytes in a sample. Taking into account this concept, our strategy follows sequential steps based on biological knowledge related to the patterns of protein expression patterns and the light scatter characteristics of the cell populations in a sample. The reason for establish the sequence of steps described below was not to mainly reduce the problem dimensionality, but to build up a path grounded on well established biological concepts. We started by separating lymphocytes from the remaining PB populations, namely the monocytes and granulocytes. It should be noted that the lymphocyte population is typically formed by a well-defined cluster of events in the two light scatter parameters (forward and sideward light scatter—FSC and SSC) (Figs. 3 and 4—step A). So, in accordance with the previous described methodology, we to separate those placed Gaussian functions: : events belonging to the populations of lymphocytes from the monocytes and granulocytes. Once classified and quantified, those events corresponding to the population of monocytes and granulocytes were discharged, and we followed to the next step just with the population of interest, the lymphocytes. Afterward, in step B in Fig. 3 (corresponding to Fig. 4, panel B), the pattern of expression of CD3 protein and light scatter measurements were used to define the population of T-lymphocytes. In step C in Fig. 3 (Fig. 4, panel C), the three different types of T-lymphocytes (T-lymphocytes expressing the CD4 protein (blue); T-lymphocytes displaying CD8 protein (violet) and T-lymphocytes with neither CD4 nor CD8 expression (yellow) were identified. These three subgroups were separated using five feature-parameters: expression of the CD4, CD8,
PEDREIRA et al.: MULTIDIMENSIONAL CLASSIFICATION APPROACH
Fig. 4. Illustrative example of the implemented automated algorithm described in Fig. 3, used for the identification of the different subpopulations of lymphocytes present in a peripheral blood sample containing leukemic sIg B -cells. In panel A, the colors yellow, violet and blue correspond to the populations of granulocytes, monocytes, and lymphocytes, respectively. Only the lymphocytes are represented in panel B; in this panel, lymphocytes appear as subdivided into two subpopulations (T lymphocytes (blue dots) and non-T lymphocytes (violet dots). T lymphocytes appear in panel C into three populations: CD4 =CD8 T cells (blue dots), CD4 =CD8 T cells (violet dots) and CD4 =CD8 T cells (yellow dots). The non-T lymphocytes are represented in panel D where they appear subdivided as B-lymphocytes (blue dots) and non-B and non-T lymphocytes (violet dots). In Panel E, the three subpopulations of B lymphocytes, present in this sample are represented: normal immunoglobulin (Ig) B lymphocytes (violet dots), normal Ig B lymphocytes (blue dots) and abnormal Ig B lymphocytes (in yellow). The separation of NK lymphocytes (blue dots) and the residual noise events (violet dots) is illustrated in panel F. The subtypes of NK lymphocytes are plotted in panel G: CD8 (blue dots) and CD8 (violet dots).
and CD3 proteins together with both the FSC and SSC light scatter characteristics of the cells. Note that, although we provide a bi-dimensional plot illustration in Fig. 4, this step in fact comprised a classification in . At this point, the method has already provided identification of all T-lymphocytes, so that any imbalance in these groups could have been verified. In the case of the example presented in Fig. 4 (see panel 4C), all populations of T-lymphocytes are normal. Following the sequence, the next step (step D of Fig. 3) was focused on non-T lymphocytes. In step D, B-cells population was first discriminated from the remaining events using the pattern of expression of the CD19 protein and both the FSC and SSC light scatter characteristics of the cells (see Fig. 4, panel D). Once again, we point out that Fig. 4 (see 4D) is just and illusspace. tration since the actual separation was done in one This is a crucial step since B-cells represent the population of lymphocytes most frequently affected by cancer. Normal B-cells express only one of two kinds of immunoglobulin light chains, either kappa or lambda. Because of this, the population of normal B-lymphocytes is typically distributed in two subpopulations according to the immunoglobulin light chain expressed (kappa or lambda). Samples containing pathologic B-cells are frequently characterized by an imbalanced ratio between the number of events belonging to each of these two subpopulations or by the presence of a subpopulation of B-cells showing a low intensity of expression of kappa or lambda
1159
immunoglobulins light chains in the cell sample. In panel E of Fig. 4 (corresponding to Fig. 3, step E), a pathological B-cell subpopulation (yellow dots) was separated from the normal ; blue dots and ; violet B-cell subpopulations ( dots in Fig. 4 step E). In this example, one can observe that the pathologic B-cells express low levels of kappa immunoglobulin and also that the kappa/lambda B-cell ratio is extremely high. In this step, separation between different B-cell populations by using the patterns of expression of was performed in kappa and lambda immunoglobulins light chains, CD19 and the two (FSC and SSC) light scatter parameters. At this point, one already knows whether a neoplastic B-cell population is present or not in the sample. The remaining population of lymphocytes are identified in the procedure continued up to steps F and G of Fig. 3 (corresponding to Fig. 4, panels F-G). In step F, the remaining events, non-B and non-T lymphocytes are distributed between NK-cells and debris (or noise events), according to the expression of CD56 protein and the two light space. scatter parameters and they can be classified in one Finally, NK-cells were together subdivided into CD8 and CD8 NK-subpopulations, according to their expression of this protein. E. Generation and Analysis of Data-Files Containing Artificially Generated Mixtures of Different Cell Populations In order to determine the sensitivity limit of detection pathologic lymphoid cells in a PB sample of the automated method, two kinds of experiments were done. The first group of experiments consisted in progressive dilutions of aliquots of pathologic samples (50%, 20%, 10%, 5%, 3%, 1%, 0.5%, and 0.1%) in a normal PB samples; this experiment was repeated for five patients’ samples. In the second set of experiments, progressive computational dilutions of data corresponding to pathologic cells, in normal PB cells were performed. The following proportions were used: 50%, 20%, 10%, 5%, 3%, 1%, 0.5%,and 0.1%. For example, in a 10% dilution, 1000 neoplastic events were added to 9000 normal events. Once again this experiment was repeated with five different patient samples. In order to evaluate the robustness of the automated method, we simulated the presence of noise events. Accordingly, variable proportions of “noise” events (between 1% and 50%) were randomly generated and added to data files containing lymphocyte events. Each of these files was built to contain a known quantity of events of each of the different subpopulations of lymphocytes. All calculations were performed using the MATLAB software program (Mathworks, Natick, MA). III. RESULTS A. Correlation Between Manual and Automated Methods for the Analysis of the Distribution of Different Subpopulations of PB Lymphocytes The proposed automated method of analysis of PB lymphocyte subsets was able to identify all cell subsets present in the 307 samples studied, which could also be detected in the same data files by an expert operator. As shown in Fig. 5, a
1160
IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 55, NO. 3, MARCH 2008
TABLE I SUMMARY OF THE RESULTS ACHIEVED WITH THE TWO TYPES OF DILUTIONAL EXPERIMENTS PERFORMED AND ACHIEVED DETECTION LIMITS
B-cells, these were detected by the automated method, while there were no pathological B-cell populations detected in normal and reactive PB samples. Accordingly, all cases having a neoplastic B-cell chronic lymphoproliferative disorder (B-CLPD) could be identified and clearly discriminated from the normal/reactive samples based on an increased percentage -cell ratio; of total B-cells and/or an alterated the most discriminating cutoff values for these variables were of 23.8% for the whole series of B-CLPD cases, and of 4.7 and neoplastic and 0.3 in cases showing pathological -cells, respectively. At these cutoff values, both a 100% sensitivity and a 100% specificity were achieved for the two variables. B. Evaluation of the Performance of the Automated Method in Analyzing Artificial Data Files Containing Low Numbers of Pathological Cells and High Numbers of “Noise” Events
Fig. 5. Correlation plots between the percentage of cells corresponding to each subpopulation of lymphocytes as calculated by the automated method proposed and the manual analytic approach done by an expert operator (n = 307 samples).
significantly high correlation was found between the results obtained by an expert operator using a conventional manual method and those obtained using the automated method for the analysis of the distribution of the major subpopulations of PB lymphocytes in the whole series of PB samples analyzed. In addition, in all samples containing pathological
The use of the automated method for the analysis of artificial data files containing decreasing numbers of pathological B-cells proved to be able to clearly classify a file as containing B-cells suspected of being pathological according to previously defined numerical criteria (cutoff values for maximum sensitivity and specificity established in Costa et al. [5]) in all dilutional experiments performed, even when the proportion of pathological B-cells only represented a small percentage (up to 5%) of all lymphocytes in the sample, if they were phenotypically different from normal residual B-cells. A similar sensitivity limit of detection was observed with the two different types of dilutional experiments performed: up to 5% for diluted leukemia cells in normal PB samples and up to 3% for diluted leukemia events in normal PB FCS-data-files (see Table I). Regarding the analysis performed by the automated method on those files corresponding to a sample randomly combined with variable numbers of “noise” events, these were found to not
PEDREIRA et al.: MULTIDIMENSIONAL CLASSIFICATION APPROACH
1161
TABLE II IMPACT OF THE INTRODUCING NOISE EVENTS IN THE RESULTS OF THE AUTOMATED METHOD OF ANALYSIS OF FLOW CYTOMETRY DATA.
significantly interfere on the analysis of the different subpopulations of PB cells present in the artificial data-file, with respect to the results of the analyses performed for the original file in the absence of artificial “noise” (see Table II). Furthermore, no false pathological events were detected due to the effect of generated noise. IV. CONCLUSION In this paper, we describe an automated approach for the analysis of flow cytometry data obtained through the staining of PB samples with a four-color, seven-marker combination of monoclonal antibody reagents. Multiparameter flow cytometry immunophenotyping has become the method of choice for the diagnostic screening of the nature of neoplastic versus reactive (infectious/inflammatory) lymphocytosis, leading to a progressively higher rate of early diagnosis of the most common group of hematological cancer -B-CLPD—, even prior to the onset of clinical manifestations [12]. However, this laboratory measurement is considered to be relatively complex and requires highly expert and trained personnel on both the analysis of immunophenotypic data and interpretation of the results. A detailed comparison of the automated method of analysis with a conventional manual operator-dependent approach showed a high correlation for the different measured subpopulations of lymphocytes. Furthermore, the automated method was able to identify abnormal subpopulations of B-cells in all pathological samples, while it did not identify any pathologic population in any normal/reactive sample. Once implemented in an appropriate software platform, the automated method can be faster and easier to perform than manual operator-dependent analyses, leading to an objective and efficient discrimination between normal/reactive and neoplastic lymphocytosis without an absolute requirement for highly expert and trained personnel for data analysis. In a day-to-day operational mode, the analysis, resulting from the implemented methodology, is automatic in the sense that it does not involve settings of parameters, nor it is dependent on specific biological expertise. Our approach does not require impositions of an explicit probability distribution function for the flow cytometric data.
An important feature of the described methodology concerns the use of a multivariate approach, which has advantages over the currently used manual method of analysis based on the definition of regions in bi-dimensional plots for subjectively chosen pairs of parameters. Our approach is tailor-made for this very important application and takes advantage of well-established biological knowledge as a priori information. In addition, the prototype-based segmentation approach employed provides a user-friendly interpretation framework, since these vectors may be viewed as being representative of groups of events. The implemented method is computationally simple and can be extended to virtually any combination of antibodies, including those based on the analysis of stainings with four or more different fluorochromes. ACKNOWLEDGMENT E. S. Costa would like to thank Prof. Nelson Spector (Clinical Medicine Graduate Program/UFRJ) for his helpful support. REFERENCES [1] A. Orfao et al., “Useful information provided by the flow cytometric immunophenotyping of hematological malignancies: Current status and future directions,” Clin. Chem., vol. 45, pp. 1708–1717, 1999. [2] E. Coustan-Smith et al., “Clinical importance of minimal residual disease in childhood acute lymphoblastic leukemia,” Blood, vol. 96, pp. 2691–2696, 2000. [3] F. Li and Y. Yang, “Analysis of recursive gene selection approaches from micro-array data,” Bioinformatics, vol. 21, pp. 3741–3747, 2005. [4] B. S. Edwards, T. Oprea, E. R. Prossnitz, and L. A. Sklar, “Flow cytometry for high-throughput, high-content screening,” Curr. Opin. Chem. Biol., vol. 8, pp. 392–398, 2004. [5] E. S. Costa, M. E. Arroyo, C. E. Pedreira, M. A. García-Marcos, M. D. Tabernero, J. Almeida, and A. Orfao, “A new automated flow cytometry data analysis approach for the diagnostic screening of neoplastic B-cell disorders,” Leukemia, vol. 20, pp. 1221–1230, 2006. [6] N. L. Harris et al., “The world health organization classification of neoplasms of the hematopoietic and lymphoid tissues: Report of the Clinical Advisory Committee meeting—Airlie House, Virginia, November 1997,” Hematol J., vol. 1, pp. 53–66, 2000. [7] M. L. Sanchez et al., “Incidence of phenotypic aberrations in a series of 467 patients with B chronic lymphoproliferative disorders: Basis for the design of specific four-color stainings to be used for minimal residual disease investigation,” Leukemia, vol. 16, pp. 1460–1469, 2002.
1162
IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 55, NO. 3, MARCH 2008
[8] M. Bellido, E. Rubiol, J. Ubeda, C. Estivill, O. Lopez, R. Manteiga, and J. Nomdedeu, “Rapid and simple immunophenotypic characterization of lymphocytes using a new test,” Haematologica, vol. 83, pp. 681–685, 1998. [9] A. Gersho, “Asymptotically optimal block quantization,” IEEE Trans. Inf. Theory, vol. IT-25, no. 4, pp. 373–380, Jul. 1979. [10] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed. New York: Wiley, 2001. [11] T. Kohonen, Self-Organizing Maps, 3rd ed. New York: Springer, 2001. [12] B. E. Schleiffenbaum, R. Ruegg, D. Zimmermann, and J. Fehr, “Early diagnosis of low grade malignant lymphoma and chronic lymphocytic leukaemia. Verification of morphologically suspected malignancy in blood lymphocytes by flow cytometry,” Eur. J. Haematol., vol. 57, pp. 341–348, 1996. Carlos Eduardo Pedreira (SM’03) was born in Rio de Janeiro, Brazil, on April 11th, 1956. He received the B.S. and M.Sc. degrees in electrical engineering (systems) from PUC-Rio, Rio de Janeiro, Brazil, in 1979 and 1981, respectively, and the Ph.D. degree from the Imperial College of Science Technology and Medicine, University of London, London, U.K., in 1987. He is currently an Associate Professor at the School of Medicine and COPPE-PEE—Engineering Graduate Program, Federal University of Rio de Janeiro (UFRJ), Rio de Janeiro, Brazil. He was the Founding President of the Brazilian Neural Networks Society. His main research interests include pattern classification, cluster analysis, neural networks, and statistical methods for biomedical applications.
Elaine S. Costa was born in Rio de Janeiro, Brazil, in 1974. She received the B.S. degree in medicine and the M.S. and the Ph.D. degrees from the Federal University of Rio de Janeiro (UFRJ) in 1996, 2003, and 2006, respectively. She became a specialist in pediatric hematology in 2000. Currently, she is with the IPPMG Hospital, UFRJ. Her main research interest is multiparametric flow cytometry of neoplastic diseases.
M. Elena Arroyo was born on June 5, 1977. She received the M.D. degree from the University of Salamanca, Salamanca, Spain, in 2000. She became a Technician in the General Cytometry Service, University Hospital of Salamanca. She is currently a Product Specialist at the Cytognos SL (private business of Cytometry Products).
Julia Almeida was born on June 14, 1964. She received the M.D. and Ph.D. degrees from the University of Salamanca, Salamanca, Spain, in 1988 and 1994, respectively. She became a Specialist in haematology and haemotherapy in 1994, at the University Hospital of Salamanca. She is currently Professor of Immunology at the University of Salamanca, and a member of the research team lead by A. Orfao at the Cancer Research Center of Salamanca.
Alberto Orfao was born on July 15, 1960. He received the M.D. degree from both the University of Salamanca, Salamanca, Spain, in 1984 and the Nova University of Lisbon, Lisbon, Portugal, in 1985 and the Ph.D. degree from the University of Salamanca in 1987. He is currently a Professor of immunology and the Director of the General Cytometry Service at the University of Salamanca, as well as a Principal Investigator at the Cancer Research Center of Salamanca. He has lead the Spanish National DNA Bank since its creation in 2002. His main research interest is in translational medicine, mainly focused on hematological malignancies and their relationship between immune system and cancer.