Multidimensional Pattern Recognition and Classification of White ... - FIU

14 downloads 566 Views 657KB Size Report
pattern recognition of different white blood cell types in flow cytometry. ... Keywords: data classification, lymphocytic leukemia, support vector machines, white blood cells ... using the so-called manual reference files of Beckman-. Coulter, which ...
107

Part. Part. Syst. Charact. 22 (2005) 107±118

Multidimensional Pattern Recognition and Classification of White Blood Cells Using Support Vector Machines Malek Adjouadi, Nuannuan Zong, Melvin Ayala* (Received: 29 October 2003; in revised form: 8 February 2005; accepted: 10 May 2005)

Abstract This study introduces a new algorithm to optimize the pattern recognition of different white blood cell types in flow cytometry. The behavior of parametric data clusters in a multidimensional space is analyzed using the learning system known as Support Vector Machines (SVM). Beckman-Coulter Corporation supplied flow cytometry data of numerous patients to be used as training and testing sets for the algorithm. Subsequently, the characteristics of the cells provided in these sets were used to train a SVM based classifier. The objective in developing this algorithm was to identify the category of a given

blood sample and provide information to medical doctors in the form of diagnostic references for a specific disease state, lymphocytic leukemia. With the application of the hypothesis space, the learning bias and the learning algorithm, the SVM classifier was successfully trained to evaluate misclassification ratios in flow cytometry data in an effort to recognize abnormal blood cell patterns and address the ubiquitous problem of data overlap through the use of the maximal margin classifier.

Keywords: data classification, lymphocytic leukemia, support vector machines, white blood cells

1 Introduction White blood cells (WBC) refer to a family of cells that do not contain hemoglobin. Normal white blood cells include lymphocytes, neutrophils, eosinophils, basophils, and monocytes. These cells are produced by bone marrow, help the body fight infection and other diseases, and are necessary for the normal functioning of the blood. On the other hand, abnormal white blood cells include blasts, immature granulocytes, and atypical lymphocytes. One method of determining that a medical abnormality may exist in the body of the patient is to note when the subpopulations exceed the acceptable range of abnormal white blood cells. The presence of unhealthy white blood cells leads to a host of complications such as deficiency of the immune system, coagulation problems, swollen lymph nodes, and other conditions. A blood cell subpopulation map has been proposed based on stan*

Dr. M. Adjouadi, N. Zong, Dr. M. Ayala, Department of Electrical & Computer, Florida International University, 10555 W. Flagler Street, Miami FL 33174 (USA). E-mail: [email protected]

 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

dard analysis of patient flow cytometry data. This model is often regarded as an average case, representing the expected locations and types of blood cells that will appear whenever a blood sample is analyzed by a flow cytometer and displayed as a dot plot with the parameters of absorbance and volume. Based on this model, the normal and abnormal cell types and their locations are as shown in Figure 1. It can be observed, that even when a typical case is represented in an absorbance vs. volume space, overlapping of regions do occur, which further complicates any classification task. For instance, lymphocytic leukemia is a condition characterized by an accumulation of abnormal lymphocytes in the blood and the bone marrow. These lymphocytes do not perform their functions as normal ones would and interfere with the production of other blood cells. This departure from the normal distribution of white blood cells serves as an indicator for the disease. The investigation began with the application of a promising concept referred to as Support Vector Machines (SVM) [1, 2] as an alternative classification approach to Gaussian approximations and to directional Voronoi applications [3, 4]. A SVM is a supervised learning system that can quickly separate patterns into two categories, DOI: 10.1002/ppsc.200400888

108

Fig. 1: Representation of normal and abnormal blood cells subpopulations function of their absorbance and volume. Lymphs, Monocytes, Eosinophils, Neutrophils represent clusters of normal cells. The rest are abnormal cells. Region overlap in this representation is also noticeable (Courtesy of Beckman-Coulter Corporation).

and even deal with non-linearly separable data by transforming and placing it in a higher dimensional feature space to make it linearly separable. SVMs are trained with a learning algorithm from optimization theory that implements a learning bias derived from statistical learning theory. Encouraging preliminary results [5] guided us into this more rigorous implementation approach.

2 Description of the Problem Medically, the traditional diagnostic approach to determining whether blood disorders exist relies on blood cell counting. Presently, for example, lymphocytic leukemia is a common disease with more than 10,000 new cases found each year in the U.S. alone. A lymphocytic leukemia patient's marrow makes too many blast cells (immature white blood cells). These blast cells should mature into healthy white blood cells called lymphocytes, but they do not. So many blast cells appear that the marrow does not sustain the ability to produce suitable quantities of normal red blood cells, white blood cells and platelets. According to the information of leukocyte subpopulations, it becomes necessary to analyze the physical characteristics associated with their parametric representations. Because of the extremely large number of blood cells that exist in a given sample, it is difficult to analyze using ordinary methods, such as the human visual inspection of blood cells under a microscope. It is

Part. Part. Syst. Charact. 22 (2005) 107±118

expected that through the use of the SVM algorithm, it becomes possible to automate the process of blood cell counting. In the process of formulating a solution to the aforementioned problem, two different approaches were attempted. The first scenario uses the characteristics of some flow cytometry blood cell data given by BeckmanCoulter Corporation to train the SVM classifiers to successfully classify the regions within the data. Here, a region is defined as the cluster whose cells have approximately similar features. Other blood cell data would then be used as test data for the classifier. The second approach uses the normal blood cell samples as well as abnormal samples of patients diagnosed with lymphocytic leukemia to train an SVM classifier. The objective is to accurately identify the abnormal samples. The major tasks undertaken were to properly format the data and successfully train the algorithm. The data passed to SVM for training purposes must be strictly formatted. The formatted data contains various features that affect the accuracy of training results generated by the SVM classifier. The accuracy of the results is proportional to the data volume. In other words, the more accurate the results we want to obtain from the data, the more data must be available for analysis. The performance of the developed algorithms was evaluated by using the so-called manual reference files of BeckmanCoulter, which contain the definite classification assessments of the blood cells.

3 Software Approach to Data Analysis Each blood cell contained in a Beckman-Coulter data file (sample) is represented by specific parameters. In the data provided, each cell contains 24 parameters which are defined in Table 1. Since many of the details involved with the understanding and interpretation of flow cytometry need be best appreciated visually, toward this end, a specialized software program known as Winlist exists and has impressive display and analysis abilities. The software provides the means to create histograms, perform gating, generate regional analysis, and use color mapping. In Winlist, the term histogram refers to single parameter-based displays, however, the software also allows for the generation of two and three-dimensional plots. The two-dimensional plots, which show information of any combination of two parameters, are termed ªdotplotsº, and are the most common types used. This type of display allows us to visualize two measured parameters on a single plot. Examples of dot plots using various sets of parameters are depicted in Figure 2.

109

Part. Part. Syst. Charact. 22 (2005) 107±118 Table 1: Parameter definition list.

30

40

50

60

(a) RLsSoft vs. DC (regions R1, R2, R3, R4)

xDC -->

Direct Current Impedance Opacity Rotate Light Scatter Median Angle Light Scatter Low Median Angle Light Scatter Upper Median Angle Light Scatter Low Angle Light Scatter Low Angle Light Scatter Low Angle Light Scatter Side Scatter light Scatter Principle Fluorescence Sensor Principle Fluorescence Sensor Principle Fluorescence Sensor Principle Fluorescence Sensor Logarithmic function of P5 Logarithmic function of P6 Logarithmic function of P7 Logarithmic function of P8 Logarithmic function of P9 Logarithmic function of P10 Logarithmic function of P11 Logarithmic function of P12 Logarithmic function of P14 Logarithmic function of P4

R7

R6 R5

20

DC OP RlsSoft Mals Lmals Umals Ls1 Ls2 Ls3 Ss Pmt1 Pmt2 Pmt3 Pmt4 LmalsLog UmalsLog Ls1Log Ls2Log Ls3Log SsLog Pmt1Log Pmt2Log Pmt4Log MalsLog

10

P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 P16 P17 P18 P19 P20 P21 P22 P23 P24

0

Parameter definition

10

20

30

xOp -->

40

50

60

50

60

(b) OP vs. DC (regions R5, R6, R7) 60

Parameter name

R8

30 0

Unlike traditional classification methods, which minimize the empirical training error, SVM aims at minimizing an upper bound of the generalization error through maximizing the margin between the separating hyperplanes of the data. This can be regarded as an implementation of the structural risk minimization principle [6]. What make SVM attractive are the properties of data expression analysis, including their flexibility in choosing a similarity function, sparseness of solution when dealing with large data sets, the ability to handle large feature spaces, and the ability to identify outliers. Application examples are found in studies [7, 8]. The maximal margin classifier is the simplest optimization model used with SVM. The SVM algorithm finds the maximum margin hyperplane, which is the hyperplane that maximizes the minimum distance from the hyperplane to the closest training point. This is illustrated in Figure 3. The maximum margin hyperplane is thus represented as a linear combination of training points. That is, if two classes are linearly separable, there will be a set of weight vectors that gives rise to separating hyperplanes, which comprise the decision function in the input space. The functional margin c of a training set S is defined as the maximum perpendicular distance between pairs of boundary hyperplanes. Figure 3 depicts

10

20

4 Concept of SVM Algorithms

xDC -->

40

50

Parameter number

10

20

(c)

30

xOp -->

40

OP vs. DC (region R8)

Fig. 2: Examples illustrating dot plots of List Mode Files using Winlist. Each cell subpopulation is represented with a different gray-scale color as converted from its original color representation.

Fig. 3: Illustration maximal margin hyperplane with its support vectors for a given training set.

110

Part. Part. Syst. Charact. 22 (2005) 107±118

the boundary hyperplane as well as the functional margin c for a sample set of data points. Figure 3 shows the classification boundary (solid line) in a two-dimensional input space as well as the accompanying margins (dotted lines). Positive and negative examples fall on opposite sides of the decision boundary. The support vectors (highlighted) are the points lying closest to the decision boundary. For simplicity, a margin value c is initially chosen to have the value 1, the positive x‡ class and negative class x can be expressed as: f …x‡ † ˆ w  x‡ ‡ b ³1 f …x † ˆ w  x ‡ b £

(1) 1

(2)

where w is the weight vector. The geometrical margin of a data point is computed as the distance from that data point to the separating plane. Consequently, the problem of finding the optimum weight vector for the separating hyperplane is closely related to the maximization of the geometrical margin of data points. These data points are termed support vectors and are the points from each class that lie closest to the separating plane. The problem of maximizing the geometrical margin can be expressed in two ways known as the primal and the dual formulation. The dual representation is expressed by means of a Lagrangian function as defined in Eq. (3), and can be easily solved via an iterative mathematical technique known as quadratic programming [9]. L…w; b; ai † ˆ

1 2 jjw jj 2

n X

ai ‰yi  …w xi ‡ b†

1Š :

…3†

iˆ1

The dual representation makes also use of constraint variables and defines the optimum weight vector wopt as a liner combination of the support vectors, as shown in Eq. (4). wopt ˆ

P

i Î ISV

ai  yi  xi

(4)

where ai ³ 0 is a Lagrange multiplier defined as a non negative constant associated to each data point, and ISV is the set of indexes of the data points that are support vectors of the data set. The sum above can also be extended to all data points (i= 1, 2... n) since the Lagrange multipliers become zero for all data points that are not support vectors. What makes the dual problem formulation so attractive is that its solution does not depend on the dimensionality of the data points but rather on the number of samples used. Having computed the optimum weight vector wopt, the optimum hyperplane bias bopt can be computed using any of the reference equations below: …wopt  x‡sv ‡ bopt † ˆ 1

or

…wopt  xsv ‡ bopt † ˆ

1

(5)

where x‡sv and xsv are any positive or negative support vector (i.e. vectors from the positive or the negative

class) respectively, and where (*) operator defines a scalar product. With the optimum weight vector wopt and the optimum bias bopt determined, a classification of a multidimensional data point w can be performed by using the following equation: f …x† ˆ wopt  x ‡ bopt ˆ

n P iˆ1

ai  yi  …xi  x† ‡ bopt :

(6)

If the function output f(w) is positive, then x is classified as the positive class and vice versa. In Eq. (6), it can be noted that the decision function for classifying points with respect to the hyperplane only involves inner products between multidimensional data points represented as vectors. Furthermore, the algorithm that finds a separating hyperplane in the feature space can be stated entirely in terms of vectors in the input space and dot products in the feature space. Thus, a SVM can locate a separating hyperplane in the feature space and classify points in that space without ever representing the space explicitly, but simply by defining a function, called a kernel function [10] that plays the role of the dot product in the feature space. This technique avoids the computational burden of explicitly representing the feature vectors [2]. Different kernels give SVM the ability to map and train the data in various feature spaces. 1. Linear kernel: the linear kernel remains the best choice to use for linearly separable sets of data since no transformation of the input space into the feature space is required. 2. Radial basis function (RBF): This is a Gaussian transformation, and where the kernel uses a uniform sigma for all feature components. 3. Polynomial: This provides a polynomial transformation of the space, which is equivalent to a polynomial separating surface in the input space. 4. Sigmoid: This provides an exponential transformation of the input space. Another appealing feature of SVM-based classification is the sparseness of its representation of the decision boundary. The location of the separating hyperplane in the feature space is specified via real-valued weights on the training set examples. Those training examples that lie far away from the hyperplane do not participate in its specification and therefore receive weights of zero. Only the training examples that lie close to the decision boundary between the two classes receive non-zero weights. Support vectors are so called due to the fact, that removing them would change the location of the separating hyperplane. The support vectors in a twodimensional feature space are as illustrated earlier in Figure 3. The SVM learning algorithm is defined so that, in a typical case, the number of support vectors is small com-

111

Part. Part. Syst. Charact. 22 (2005) 107±118

pared to the total number of training examples. This property allows the SVM to classify new examples efficiently, since the majority of the training examples can be safely ignored. In essence, the SVM focuses upon the small subset of examples that are critical to differentiating between class members and non-class members, throwing out the remaining examples. This is a crucial property when analyzing large data sets containing many uninformative patterns, as is the case in many data mining problems. SVMs effectively remove the uninformative patterns from the data set by assigning them weights of zero.

Original Data Sample Data Format Formatted Training Data Train Support Vector Machines

Support Vector Machines (α i , b) Testing data Test

5 Algorithm Used for the SVM Implementation

Trained Support Vector Machines

Figure 4 illustrates the general structure of the algorithm as implemented in this study. The computation of the support vectors was performed with a gradient descent method programmed in MATLAB. Starting point is a simple rule for learning algorithms that is based on the dot product between inputs and targets. In this approach, the Lagrange multipliers (ai ³ 0) are updated using Eq. (7) whenever inequality Outputi ¹ Targeti is satisfied. ai ˆ ai ‡ Dai :

(7)

Since inequality Outputi ¹ Targeti also holds when (yi  Output) £ 0, taking into account that Outputi ˆ n P f …w xi ‡ b† and replacing w by … ap  yp  xp ‡ b †, one pˆ1

obtains the final update formula for each sample i: if yi  …

n P pˆ1

p

yp  …xp  xi † ‡ b† £ 0 then ai ˆ ai ‡ Dai :

(8)

The Lagrange multiplier ai is updated for each sample in the data set. The value of Dai depends on the error for each data sample, and on the convergence rate desired of Eq. (9), and is computed as follows: 1 Dai ˆ xi  xi

" 1

yi

n X

# ap  yp  …xp  xi † :

…9†

p

Initially, the original data as obtained from BeckmanCoulter Corporation is formatted to extract the features given in Table 1. This formatted data is used to train the Support Vector Machine, where the parameters ai and b are obtained. The testing data, which consists of different cell sub-populations from the same blood sample, is also formatted and sent to the trained SVM for classification. The results of the testing allow for final labeling of the subpopulations.

Class label (results)

Fig. 4: The implementation procedure of the algorithm.

6 Verification of the Regions The information obtained from the flow cytometry data on blood cells is divided into two sets of parameters: principal parameters and derived parameters. The prototype of the flow cytometer collects principal parameters and derived parameters obtained based on different combinations of the principal parameters. According to the parameters generated from flow cytometric experiments and in reference to Figure 1, the white blood cell subpopulations can be grouped into different clusters as shown in Figure 5. It is noted that for a normal sample, the subpopulation of R1 is about 6%; R2 is 2%; R3 is 12%; R4 is 18%; R5 is 14%; R6, R7 and R8 are about 1% together. There are in total 53% of cells in non-overlapping regions. On an average, 47% of the cells fall into the overlapping regions. Since the SVM classifier is binary, meaning it can only separate two classes at a time, the training data will be assigned to the +1 class or the ±1 class. Having obtained the features of the blood cells, the training of classifier may commence. At the beginning of the training process, the classifier must be initialized for the particular kernel, the kernel parameters, and the dimensions of the training array. Additionally, the upper limit to the alpha terms must also be chosen. Once the classifier is initialized, it can be trained by providing it with the training array and the mapping vector. As the algorithm processes the rows of the training array, it learns to recog-

112

Part. Part. Syst. Charact. 22 (2005) 107±118

The last column ªRegion number or labelº represents the region R# (either as a label or as a number) that each cell should belong to. This last column is not supplied with the data file; rather it is filled up such that relation (10) is satisfied: Rval = 2R# ±1 .

(10)

The last column was used as a target during the training phase. The region assignments following the rule are explicitly given in Table 3. Table 3: Rules for region definition.

Fig. 5: Representation of white blood cell subpopulations in the Absorbance vs. Volume chart, with region labels included. Region R8 is not shown since it has a small number of cells (Courtesy of Beckman-Coulter Corporation).

nize those particular characteristics of the class indicated by the mapping vector. Once the classifier is trained, data extracted from the test blood cells can be fed into the classifier. The output of this step is a prediction result of the test cells. This prediction vector indicates to which class the classifier understands that a particular cell belongs to. In order to determine the success rate of the classification effort, it must be known, in advance, to which class each cell belongs. Then, by comparing the predictions to the expectations, a ratio of misclassification can be obtained by dividing the number of misclassifications by the total number of cells in this particular region. For this study, 50 samples, with each patient represented by a single sample, have been considered. Each sample includes the 24 parameters earlier defined in Table 1. The format of the data given for each blood cell recorded in a flow cytometry file is as illustrated in Table 2. Table 2: Data format of white blood cells. Region Cell DC OP RlsSoft ... Pmt4Log MalsLog Region number No. value (Rval=2R#±1) (R#) 1 2 3 4 5 ...

46 43 12 40 21 ...

37 36 31 43 34 ...

31 28 31 40 31 ...

... ... ... ... ... ...

52 51 12 60 45 ...

38 35 32 46 33 ...

1 1 4 8 4 ...

R1 R1 R3 R4 R3 ...

Region label (R#)

Region value (Rval =2R# ±1)

R1 R2 R3 R4 R5 R6 R7 R8

20 21 22 24±1 24 25 26 27

= 1 = 2 = 4 = 8 = 16 = 32 = 64 = 128

Based on the definition of the regions, many cells may fall in the overlapping areas between two regions, and as such applying Eq. (12) would yield non integer values for region classification. To account for the overlapping areas and still avoid obtaining non integer values for region classification, Beckman Coulter applies a procedure that decomposes the region value into powers of two. For example: If a cell has region value Rval of 5, which is not a power of 2, the value can be decomposed as 5 = 1 + 4 or 5 = (21±1 + 23±1). From such decomposition, it can concluded that the cell is in the intersection of regions R1 and R3. Similarly, a region value Rval of 68 can be decomposed as (68 = 4+64 = 23±1 + 27±1), meaning that the cell appears in the intersection of regions R3 and R7, and so on. Since SVMs can only classify a two-category type of data, the classifier was trained pairwise, meaning that all possible combinations of 2 regions where analyzed separately.

7 Implementation and Results for Region Classification For this study, the same training and testing procedure was repeated for all 50 patients (50 samples), regardless of whether the patient was sick or healthy. In fact, there was no prior knowledge on which samples are healthy and which ones are not. Each patient is represented by a single sample containing all 24 parameters as defined in Table 1, and where 12,288 white blood cells of different

113

Part. Part. Syst. Charact. 22 (2005) 107±118

types are contained in each sample. For illustrative purposes, the results shown in this section correspond to file 9F0BG00A.txt, which contains the normal sample of one particular patient. The regional statistics of cells are as provided in Table 4. Table 4: Number of cell subpopulations contained in each region. Region number

Cell Populations

R1 R2 R3 R4 R5 R6 R7 R8 Overlapped regions

771 cells 232 cells 1450 cells 2186 cells 1269 cells 1757 cells 74 cells 43 cells 4506

The first 100 cells of each region were chosen as the data needed to train the support vector machines, and then the remaining cells of the non-overlapping regions as well as the cells in the overlapping regions were used randomly to test the classifier. Also, the cell populations shown above were applied as true grounds to test the percentage of misclassification. As can be noted from the cell populations above, the number of cells in regions R7 and R8 are less than 100. Thus, in this study, the classification of cells belonging only to the first 6 regions has been performed for these specific data files that were available. Even though the classification seems to involve several classes, it was possible to simplify the problem by using binary classification. It was also considered that the flowcytometer already performs a multi-category region classification by providing region values for each cell. The region or regions each cell should belong to can be inferred from its region value as indicated earlier. The classification performed in this study is done for those cells that could not unambiguously be classified by the flowcytometer. Fortunately, the maximum classification uncertainty provided by the flow-cytometer consisted on assigning a cell to an overlap of no more than two regions. Therefore, binary classification was chosen as the most appropriate way of dealing with the data provided. In both the non-overlapping and the overlapping cases, the regions can be easily obtained by applying Eqs. (12) and (13). The number of combinations of possible intersections of 2 out of the 6 regions can be computed as C62 ˆ 15. Potentially, there are a total of 15 areas where two regions can overlap. Thus, 15 different SVM's, denoted as SVMR1\R2, SVMR1\R3,. .., SVMR5\R6) were trained to correctly classify the cells into one out of two possible regions.

According to the information contained in the raw data, a binary classification learning method highly increases classification performance, since it reduces the risk of assigning a cell to a region which is not in the corresponding overlapping area. A binary classification approach both simplifies the method and fully accounts for the constraints regarding the region value each cell should belong to. The integrated training/testing procedure is implemented as follows: (1) The first 100 cells from non-overlapping regions are taken to train each of the 15 classifiers; (2) The remaining cells of these regions as well as the cells in the overlapping regions which were not used in the training are classified with the trained SVMs; (3) The classification error is computed by comparing the classifier output with the targets contained in the raw data (Table 2) as well as with the true classifications contained in the manual reference files. For example, classifier SVMR1\R2 is trained using the first 100 cells of regions R1 and R2. Then, the remaining cells of those regions as well as those corresponding to the overlapping area with region value Rval=3 (cells in the intersection of regions R1 and R2) that were not used in the training are classified. The output of classifier SVMR1\R2 is compared with the targets (R1 or R2 for these cells in Table 2) as well as with the true classifications contained in the manual reference file in order to obtain the classification error of the classifier. Misclassification Ratios: R1 vs. R4 An initial training attempt was made using the data from regions R1 and R4 as shown in Figure 6. The opti-

Fig. 6: Representation of region R1 (o) and R4 (*).

114

Part. Part. Syst. Charact. 22 (2005) 107±118

mization of the weights vectors was completed after 72 iterations and the number of support vectors needed was found to be 5. After training, it was also found that the percentage of cells that belong to R1 but were misclassified as belonging to R4 is 0.30%. Likewise, the percentage of cells that belong to R4 but were misclassified as belonging to R1 is 0.14%. The largest misclassification is thus still much less than 1%, which is regarded as rather insignificant given the extent of the data. Misclassification Ratios: R1 vs. R3 The analysis of regions R1 and R3 as shown in Figure 7 yielded different misclassification rates. The percentage of cells that belong to R1 but were misclassified as belonging to R3 was 1.19%. On the other hand, the percentage of cells that belong to R3 but were misclassified as belonging to R1 is 1.04%. In this training example, the misclassification percentage is slightly higher, but is still well within tolerance limits. Also, the optimization of the weight vectors was completed after 101 iterations and the number of support vectors was found to be 6 in this case.

Fig. 8: Representation of regions R3 (o) and R4 (*).

Misclassification Ratios: R5 vs. R6 For these two regions as shown in Figure 9, the percentage of the cells which should belong to R5 but were misclassified as belonging to R6 was 0.34%. Similarly, the percentage of the cells which should belong to R6 but were misclassified as belonging to R5 was 0.18%. The optimization of the weights vectors was completed after 20781 iterations and the number of support vectors was found to be 14. In retrospect, and following the same methodology, the classification between pairs of different regions was continued until all the results shown in Table 5 were ob-

Fig. 7: Representation of regions R1 (o) and R3 (*).

Misclassification Ratios: R3 vs. R4 For this pair of regions, the percentage of cells which should belong to R3 but were misclassified as belonging to R4 was 0%. Also the percentage of the cells which should belong to R4 but were misclassified as belonging to R3 was 0.05%. The optimization of the weights vectors was completed after 21 iterations and the number of support vectors was found to be 4 for this classification case as illustrated in Figure 8.

Fig. 9: Representation of regions R5 (o) and R6 (*).

115

Part. Part. Syst. Charact. 22 (2005) 107±118 Table 5: Classification ratios of regions for file 9F0BG00A.txt, expressed in percentages. Each entry expresses the percentage of cells that actually belong to region Ra but were classified as belonging to region Rc. For example, the value of 1.19 in the 3rd column and 1st row means that 1.19% of the cells belonging to region R1 were misclassified as belonging to region R4. Classified as (Rc) Actual region (Ra)

R1 R2 R3 R4 R5 R6

R1

R2

R3

R4

R5

R6

98.51 0 1.04 0.14 0 0

0 100.00 0 0.19 0.09 0

1.19 0 98.44 0.05 0.02 0.06

0.30 0 0 99.62 0 0.17

0 0 0.30 0 99.55 0.18

0 0 0.22 0 0.34 99.59

tained. The same procedure was applied for all other samples, and comparably good results were obtained throughout.

8 Classification of Abnormal Versus Normal Samples There are 50 sets of data that were considered in this study. Thirty subjects diagnosed with lymphocytic leukemia have been analyzed as the abnormal samples in this study. Specifically, 20 normal patient data sets of these were designated as belonging to the ±1 class and 30 patient data sets were designated as belonging to the +1 class.

In Figure 10, an abnormal sample with lymphocytic leukemia is shown (file 01D44BA2.LMD) containing 8192 cells in a two-dimensional space. It can be observed that this particular patient's marrow produces too many immature lymphocytes. In fact, lymphocytic leukemia is manifested by progressive accumulation of these cells in the blood. Also, thirty normal samples were run as normal flow cytometry control data and have also been analyzed using this approach. For this particular classification approach, prior to determining if a sample is normal or abnormal, a decision had to be made as to the selection of the parameters to be used in the data classification process. This step was necessary due to the specific nature of the samples provided in this study: each sample is given as a matrix rather than as a multidimensional vector. Obviously, a reduction of dimensionality was an indispensable condition. The first step in this direction was performed by considering only three of the 24 available parameters. Those parameters chosen, considered as the main parameters, are DC, OP and RlsSoft. The remaining 21 parameters are known to have been generated by the flow cytometer as a function of these three main parameters. The conceptual reason for doing so is based on assuming that these 3 parameters are the principal components, whereas other parameters are derivatives formulated based on different functions of the principal parameters. Following the considerations stated above, the initial subject samples were thus reduced to matrixes of size

Fig. 10: Illustration of a lymphocytic leukemia sample ± 01D44BA2.LMD. Each cell subpopulation is represented with a different grayscale color as converted from its original color representation.

116

150

170

70

130

60

Number

110

50

80

90

40

Number

50

60

70

30

30

40

20

0

10

20

10 0 10

30

90

110

130

DC -->

150

170

190

210

230

10

250

30

50

70

90

110

130

OP -->

150

170

190

210

230

250

60 50 20

30

40

Number

70

80

90

100

(b) Histogram of OP parameter

10 10

30

50

70

90

110

130

RLS -->

150

170

190

210

230

250

(c) Histogram of RlsSoft parameter Fig. 11: Histogram of the three main parameters as obtained from the sample file.

and the kurtosis (or convexity) given by . 4 k ˆ E…xi  x† r 4 , where E defines the expected value. Therefore, the input space was created by presenting each sample as a 5 ” 3 matrix. This idea is shown in Table 6. The number of elements in this matrix, in this case 15, is an acceptable number of dimensions that can be handled with any classification task. Thus, each subject sample is now handled as a 15-dimensional vector. Figures 11 (a)±(c) show the histograms by applying the 3 main parameters with the sample file. These 3 histo-

Table 6: Feature extraction of the statistical parameters Mean, Peak, Standard Deviation, skewness, and kurtosis applied to the histogrammed DC, OP, and RlsSoft parameters creating a compressed set of 15 parameters for each data sample. Mean Peak Standard Deviation Skewness Kurtosis

70

(a) Histogram of DC parameter

iˆ1

peak defining the maximum value of the sample; the sample standard deviation given by n 1 P 2 r ˆ …n 1 1 …xi  x† †2 ; the skewness (or degree . iˆ1 3 of asymmetry) defined as y ˆ E…xi  x† r 3 ;

50

0

8192 ” 3. But still, representing such a matrix in a multidimensional space requires as many dimensions as there are elements in the matrix, in this case 24,576. Obviously, a classification problem with this many dimensions could not possibly be considered given the heavy computational requirements. The practical solution to circumvent this problem was to extract key features of each one of the matrixes of size 8192 ” 3 and use those features as the new dimensions of the samples. Feature extraction is a practical solution that always drastically reduces the size of data sets. In the present case, different statistical parameters were analyzed for possible usage to represent the histogrammed data. Based on their variability, only 5 parameters were chosen, namely the mean, peak, standard deviation, skewness and kurtosis. These parameters were computed from histogram representatives from each one of the three main parameters. Using these histograms, the 5 aforementioned standard features are used as follows: The n P xi ; the mean of the sample, defined as  x ˆ n1

190

Part. Part. Syst. Charact. 22 (2005) 107±118

DC

OP

RlsSoft

Mean (DC) Peak (DC) STD (DC)

Mean (OP) Peak (OP) STD (OP)

Mean (RlsSoft) Peak (RlsSoft) STD (RlsSoft)

Skewness (DC) Skewness (OP) Skewness (RlsSoft) Kurtosis (DC) Kurtosis (OP) Kurtosis (RlsSoft)

grams are constructed from the 8192 DC/OP/ RlsSoft sample values from a single patient file. For each patient thereof, similar histograms are constructed.

9 Implementation and Results for the Classification of Abnormal Samples Since there are limited abnormal data sets available to train the system, using the Leave-one-out method is an effective way to train the algorithm. The leave-one-out method examines the pattern recognition performance on each individual data vector by removing it from the complete training set, and examining the removed vector as if it were a new test vector. This approach maximizes the use of the available training data, avoids the favorable bias of the result that occurs by including the vector that is being classified, and examines the performance on all available data. The five features (mean, peak, standard deviation, skewness and kurtosis) have been extracted from the histogram of each of the 3 parameters (DC, OP and RlsSoft) to be used as inputs to the SVM classifier.

117

Part. Part. Syst. Charact. 22 (2005) 107±118 Table 7: Data format after feature extraction. During training, each blood sample's training pattern consists of these 15 parameters and a codified target of ª±1º or ª+1º depending on whether the blood sample status is normal or abnormal, respectively. Mean Peak Standard Deviation Skewness Kurtosis

DC

OP

RlsSoft

1885.73 4095.00 706.95 ±0.58 2.88

1120.51 3423.00 457.57 ±0.43 2.49

561.37 4095.00 407.64 1.60 9.49

Table 8: Results of sample classification expressed via a confusion matrix, showing the amount of true negatives (TN), false positives (FP), false negatives (FN), and true positives (TP). Classified as Actual

Normal (Total = 20) Abnormal (Total = 30)

Normal

Abnormal

TN = 19

FP =1

FN = 4

TP = 26

After the above described features were obtained for all the 50 data sets available for this study, the algorithm as depicted earlier in Figure 6 was applied to the data sets. The classification results as shown in Table 8 are listed following the format used in Receiver Operating Characteristics (ROC)-Analysis [11]. The classification results provided in Table 8 show that for the 20 normal patients, only one set was misclassified, yielding an accuracy of 95.00%; while for the 30 abnormal sets, 4 of the sets were misclassified, yielding an accuracy of 86.67%. Using ROC terminology, the following indicators evaluate the degree of accuracy achieved when applying SVM: With TN = 19, FP = 1, FN = 4, and TP = 26, a true positive rate (TP/(TP+FN) of 86.67% and a false positive rate (FP/(FP+TN) of 5.00% are achieved. Given the subtle behavior of data clusters in flow cytometry data compounded with the ubiquitous problem of data overlap, these results were most encouraging at this stage of the algorithm development process.

10 Conclusion The results show that SVM application is a powerful tool in the field of object classification and pattern recognition. The added advantage of being able to improve results by retraining the classifier means that an adaptive approach is realizable. Enhanced validation of the process lies in the acquisition of additional patient data for additional training of the classifier.

Throughout the algorithm implementation and testing processes, it was noted that the number of support vectors is a very important parameter. That is, the misclassification ratio increases as the number of support vectors increases. It was expected that the number of support vectors to separate subclasses in these data sets is small due to the fact that each data set has been efficiently represented into a lower dimensional set of features (of dimension 5 ” 3). If the data had been left in its original state (initially requiring dimensions of 8192 ” 3), the number of support vectors required to separate subclasses in these data sets would have been prohibitively large, leading to failures in algorithm conversion. It was determined that the effect of requiring small numbers of support vectors for pattern classification is twofold: 1) Reduces the algorithm's computation time, since smaller coefficient matrices are required with associative numerical operations. 2) Increases the convergence rate as the misclassification ratio associated with separated classes of data decreases. The first approach which consisted of performing region classification of white blood cells considered 30 samples consisting each of 12, 288 cells, including healthy samples and lymphocytic leukemia samples. In this first approach, there was no knowledge on which samples were healthy and which were not. The results in this case yielded an accuracy of 98% or higher, which meant that there was less than a 2% chance for a cell belonging to a given region to be misclassified as belonging to another region. In the second approach, which considered another set of 50 samples, with each sample consisting of 8,192 cells, the task was in discriminating 20 healthy samples from 30 lymphocytic leukemia samples as predefined by Beckman-Coulter Corporation. In this second approach, the results yielded a true positive rate of close to 87% (meaning that for the 30 abnormal sets, 4 of the sets were misclassified as being normal) and a false positive rate of 5.00% (meaning that for the 20 normal sets, 1 set was misclassified as abnormal). All samples were provided courtesy of Beckman-Coulter Corporation. The number of cells per sample, the formatting, and the dimensionality of the data were used in this algorithm as initially preset.

11 Acknowledgments This research was supported by the National Science Foundation Grants EIA-9906600, HRD- 0317692, CNS 042615; and the Office of Naval Research Grant N00014-99-1-0952. The support of Beckman-Coulter is greatly appreciated.

118

Part. Part. Syst. Charact. 22 (2005) 107±118

12 Nomenclature b bopt E f …x†

bias optimum hyperplane bias expected value real-valued function before thresholding kurtosis primal Lagrangian logarithm to the base 2 actual region classified region region value region number or label weight vector optimum weight vector data point in input space sample mean positive support vector negative support vector skewness output in output space dual variables or Lagrange multipliers margin sample standard deviation norm

k L log2 Ra Rc Rval R# w wopt x  x x‡sv xsv y yi ai c r kk

13 References [1] [2]

V. N. Vapnik, A. J. Chervonenkis, Theory of Pattern Recognition. Nauka, Moscow, 1974. V. N. Vapnik, The Nature of Statistical Learning Theory. Springer, New York, 1995.

[3] C. Reyes, M. Adjouadi, A Directional Clustering Technique for Random Data Classification. J. Cytometry, 1997, 27, 126±135. [4] M. Adjouadi, C. Reyes, P. Vidal, A. Barreto, An Analytical Approach to Signal Reconstruction Using Gaussian Approximations Applied to Randomly Generated Data and Flow Cytometric Data, IEEE Transactions on Signal Processing, 2000, 48, 2839±2849. [5] N. Zong, M. Adjouadi, Multidimensional Pattern Recognition and Classification of White Blood Cells Using Support Vector Machines. Proc. 7th World Multiconference on Systemics, Cybernetics and Informatics, Orlando, USA, July 27th±30th, 2003, pp. 101±106. [6] I. Guyon, V. Vapnik, B. Boser, L. Botou, S. Solla, Structural Risk Minimization for Character Recognition. Advances in Neural Information Processing Systems, 1992, Vol. 4, Morgan Kaufman, Denver. [7] H. Drucker, D. Wu, V. N. Vapnik, Support Vector Machines for Spam Categorization. IEEE Transactions on Neural networks, 1999, 10, 5. [8] M. Brown, W. Grundy, D. Lin, N. Cristianini, C. Sugnet, T. Furey, M. Ares, Jr., D. Haussler, Knowledge-based Analysis of Microarray Gene Expression Data Using Support Vector Machines. The National Academy of Sciences. 2000, 97, 262±267. [9] C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998, 2, 121±167. [10] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines and other Kernel-Based Learning Methods. University of Cambridge Press, 2000. [11] J. Tilbury, P. Eetvelt, J. Garibaldi, J. Curnow, E. Ifeachor, Receiver Operating Characteristic Analysis for Intelligent Medical Systems ± A New Approach for Finding Confidence Intervals, IEEE Transactions on Biomedical Engineering, 2000, 47, 952±963.

Suggest Documents