Artif Life Robotics (2009) 14:16–19 DOI 10.1007/s10015-009-0712-z
© ISAROB 2009
ORIGINAL ARTICLE Mohd Saberi Mohamad · Sigeru Omatu · Safaai Deris Michifumi Yoshioka
Particle swarm optimization for gene selection in classifying cancer classes
Received and accepted: April 2, 2009
Abstract The application of microarray data for cancer classification has recently gained in popularity. The main problem that needs to be addressed is the selection of a small subset of genes from the thousands of genes in the data that contribute to a disease. This selection process is difficult due to the availability of a small number of samples compared with the huge number of genes, many irrelevant genes, and noisy genes. Therefore, this article proposes an improved binary particle swarm optimization to select a near-optimal (small) subset of informative genes that is relevant for the cancer classification. Experimental results show that the performance of the proposed method is superior to the standard version of particle swarm optimization (PSO) and other previous related work in terms of classification accuracy and the number of selected genes. Key words Gene selection · Hybrid approach · Microarray data · Particle swarm optimization
1 Introduction Microarray technology is a device that can be employed in measuring expression levels of thousands of genes simultaneously. It finally produces microarray data that contain information of biological, diagnostic, and prognostic use for researchers.1 Thus, there is a need to select informative genes that contribute to a cancerous state. However, the
M.S. Mohamad (*) · S. Omatu (*) · M. Yoshioka Department of Computer Science and Intelligent Systems, Graduate School of Engineering, Osaka Prefecture University, Sakai, Osaka 599-8531, Japan e-mail:
[email protected];
[email protected] M.S. Mohamad · S. Deris Department of Software Engineering, Faculty of Computer Science and Information Systems, Universiti Teknologi Malaysia, 81310 Skudai, Johore, Malaysia This work was presented in part at the 14th International Symposium on Artificial Life and Robotics, Oita, Japan, February 5–7, 2009
gene selection process poses a major challenge because of the following characteristics of microarray data: the huge number of genes compared with the small number of samples (high-dimensional data), irrelevant genes, and noisy data. To overcome this challenge, a gene selection method is used to select a subset of genes that increases the classifier’s ability to classify samples more accurately. Recently, several methods based on particle swarm optimization (PSO) have been proposed to select informative genes from microarray data.2–4 PSO is a new evolutionary computation technique proposed by Kennedy and Eberhart.5 It was motivated by simulations of the social behavior of organisms such as bird flocking and fish schooling. Shen et al.2 have proposed a hybrid of PSO and tabu search approaches for gene selection. However, the results obtained by using the proposed hybrid method are less significant because the application of tabu approaches in PSO is unable to search into all possible search spaces. Next, an improved binary PSO was proposed by Chuang et al.3 This approach produced 100% classification accuracy in many data sets, but it used a high number of selected genes to achieve this good result. This was because all the global best particles were reset to the same position when their fitness values did not change after three consecutive iterations. Li et al.4 introduced a hybrid of PSO and GA for the same purpose. Unfortunately, the accuracy of the result was still not high and many genes were selected for cancer classification since there is no probability relations between GA and PSO in the proposed hybrid method. Generally, the proposed methods that are based on PSO2–4 cannot efficiently produce a near-optimal (small) subset of informative genes for high classification accuracy. This is mainly because the total number of genes in the microarray data is too large (highdimensional data). The diagnostic goal is to develop a medical procedure based on the least possible number of genes that are needed to detect diseases. Thus, we propose an improved binary PSO to select a small (near-optimal) subset of informative genes that is most relevant for cancer classification. The proposed method is evaluated on two real microarray data sets.
17
if vid(t + 1) = 0 then
2 Methods
P ( xid( t + 1) = 1) = 0.5 or Sig ( 0 ) = 0.5
2.1 The standard version of binary PSO (BPSO)
if vid(t + 1) < 0 then
BPSO is initialized with a population of particles. At each iteration, all particles move in a problem space to find the optimal solution. A particle represents a potential solution (gene subset) in an n-dimensional space.6 Each particle has position and velocity vectors for directing its movement. The position vector and velocity vector of the i-th particle in the n-th dimension are represented as Xi = (x1i, x2i, . . . , xin) and Vi = (v1i, v2i, . . . , vin), respectively, where xid is a binary bit, i = 1, 2, . . . m (m is the total number of particles), and d = 1, 2, . . . n (n is the dimension of the data). In gene selection, the vector of particle positions is represented by a binary bit string of length n, where n is the total number of genes. Each vector Xi denotes a gene subset. If the value of the bit is 1, it means that the corresponding gene is selected. Otherwise, a value of 0 means that the corresponding gene is not selected. Each particle in the t-th iteration updates its own position and velocity according to the following equations: v ( t + 1) = w ∗ v ( t ) + c1r1 ∗ ( pbest ( t ) − x ( t )) + c2r2 ∗ ( gbest d( t ) − xid( t )) d i
d i
Sig (v ( t + 1)) = d i
if
d i
d i
1
1 + e − vi (t + 1) d Sig (vi ( t + 1)) > r3, then xid( t + 1) = 1; d
else xid( t + 1) = 0
(1) (2) (3)
where w is the inertia weight, c1 and c2 are the acceleration constants in the interval [0,2], r1, r2, and r3 are random values in the range [0,1], and Pbesti = (pbest1i, pbest2i, . . . , pbestin) and Gbest = (gbest1, gbest2, . . . , gbestn) represent the best previous position of the i-th particle and the global best position of the swarm (all particles), respectively. Sig(vid(t + 1)) is a sigmoid function where Sig(vid(t + 1)) ∈ [0,1]. 2.2 An improved binary PSO (IPSO) In this article, we propose IPSO for gene selection. This is introduced to solve the problems resulting from the microarray data, overcome the limitations of previous work2–4, and align with the diagnostic goal. IPSO in our work differs from the methods used in previous work2–4 in one major area. This difference is that we modify the existing rule (Eq. 3) for updating particle positions, whereas the previous work used a standard rule (Eq. 3). Firstly, we analyze the sigmoid function (Eq. 2). This function represents the probability of xid(t + 1) being 0 or 1 (P(xid(t + 1) = 0) or (P(xid(t + 1) = 0))). It has the following properties. lim Sig (vid( t + 1)) = 1
(4)
lim Sig (vid( t + 1)) = 0
(5)
vid ( t + 1)→∞
vid ( t + 1)→−∞
(6)
P ( xid( t + 1) = 1) < 0.5 or Sig (vid( t + 1) < 0 ) < 0.5
(7)
if v (t + 1) > 0 then d i
P ( xid( t + 1) = 1) > 0.5 or Sig (vid( t + 1) > 0 ) > 0.5
(8)
P ( x ( t + 1) = 0 ) = 1 − P ( x ( t + 1) = 1)
(9)
d i
d i
Also note that the value of x (t + 1) can change even if the value of vid(t + 1) does not change, due to the random number r3 in Eq. 3. To implement IPSO, the following approaches are suggested. d i
2.2.1 Modifying the existing rule for updating particle positons In order to support the diagnostic goal that needs the least number of genes for accurate cancer classification, the rule for updating particle positions is simply modified as follows: if
S (Vi( t + 1)) > r3, then xid( t + 1) = 0;
else xid( t + 1) = 1
(10)
The value of particle velocity Vi in the modified formula (Eq. 10) represents all the elements of a particle velocity vector, whereas the standard formula (Eq. 3) represents a single element. Moreover, Vi is also a positive real number. Based on this positive velocity value, Eqs. 2 and 10, the possibility of xid = 1 is too small. This situation causes a small number of genes to be selected in order to produce a nearoptimal gene subset from high-dimensional data (microarray data). 2.2.2 A simple modification of the formula for updating particle velocities In this formula, the calculation of the value of the velocity is completely based on all the bits of a particle position vector, whereas the original formula (Eq. 1) is based on a single bit. Vi ( t + 1) = w ∗ Vi ( t ) + c1r1 ∗ ( Pbesti ( t ) − X i ( t )) + c2r2 ∗ (Gbest ( t ) − X i ( t ))
(11)
2.2.3 Calculation for the distance of two positions The number of different bits between two particles relates to the difference between their positions, for example, Gbest(t) = [1011101001] and Xi(t) = [0100110101]. The difference between Gbest(t) and Xi(t) is [1–1110–11–100]. A value of 1 indicates that compared with the best position, this bit (gene) should be selected, but it is not selected, which may decrease the classification quality and lead to a lower fitness value. In contrast, a value of −1 indicates that, compared with the best position, this bit should not be selected, but it is selected. The selection of irrelevant genes
18
makes the length of the subset longer and leads to a lower fitness value. Assume that the number of 1’s is a, whereas the number of −1’s is b. We use the absolute value of (a − b) to express the distance between the two positions. In this example, (a − b) = 4 − 3, so the distance between Gbest(t) and xi(t) is Gbest(t) − Xi(t) = 1. 2.2.4 Fitness function The fitness value of a particle (a gene subset) is calculated as follows: fitness ( X i ) = w1 × A ( X i ) + (w2 ( M − R ( X i )) M )
(12)
in which A(Xi) ∈ [0,1] is leave-one-out-cross-validation (LOOCV) accuracy that uses only the genes in a gene subset (Xi). This accuracy is provided by an SVM classifier. R(Xi) is the number of selected genes in Xi. M is the total number of genes for each sample. w1 and w2 are two priority weights corresponding to the importance of accuracy and the number of selected genes, respectively, where w1 ∈ [0.1,0.9] and w2 = 1 − w1.
3 Experiments 3.1 Data sets and experimental setup Two benchmark microarray data sets are used to evaluate IPSO: the leukemia cancer and colon cancer data sets. The leukemia data set contains the expression levels of 7129
genes and can be downloaded at http://www.broad.mit.edu/ cgi-bin/cancer/datasets.cgi. It has 72 samples. For the colon cancer data set, there are 62 samples. This can be obtained at http://microarray.princeton.edu/oncology/affydata/index. html. Firstly, we applied the gain ratio technique to pre-select 500 top-ranked genes. These top-ranked genes are then used by IPSO and BPSO in the next process. In this article, LOOCV is used to measure the classification accuracy of a gene subset. The implementation of LOOCV is carried out in exactly the same way as in Chuang et al.3 The two most important criteria are considered to evaluate the performances of IPSO and BPSO: LOOCV accuracy and the number of selected genes. A near-optimal subset that produces the highest classification accuracy with the smallest number of genes is selected as the best subset. Several experiments are independently conducted 10 times on each data set using IPSO and BPSO. Next, an average result of the 10 independent runs is obtained. 3.2 Experimental results Based on the standard deviations of classification accuracy and the number of selected genes (shown in Table 1), the results produced by IPSO were nearly consistent in both data sets. Interestingly, all runs achieved 100% LOOCV accuracy in the leukemia data set with fewer than 5 selected genes. It is worth mentioning that according to Table 2, the overall classification accuracy and the number of selected
Table 1. Classification accuracies for each run using IPSO Run No.
Leukemia data set
Colon data set
Classification accuracy (%)
No. selected genes
Classification accuracy (%)
No. selected genes
1 2 3 4 5 6 7 8 9 10
100 100 100 100 100 100 100 100 100 100
4 2 4 4 3 4 4 3 4 3
93.55 93.55 96.77 93.55 93.55 95.16 93.55 95.16 93.55 93.55
5 5 4 5 4 5 4 4 5 4
Average ± SD
100 ± 0
3.50 ± 0.71
94.19 ± 1.13
4.5 ± 0.53
Note: The results of the best subsets are shown in shaded cells SD, standard deviation
Table 2. A comparison in terms of the statistical results of the proposed IPSO and BPSO Data
Method
IPSO
BPSO
Evaluation
The best
Average
SD
The best
Average
Leukemia
Classification accuracy (%) No. selected genes
100 2
100 3.50
0 0.71
98.61 216
98.61 224.70
0 5.23
Colon
Classification accuracy (%) No. selected genes
94.19 4.50
1.13 0.53
87.10 214
86.94 231
0.51 10.19
96.77 4
Note: The best result of each data set is shown in shaded cells SD, standard deviation
SD
19 Table 3. A comparison between our proposed method (IPSO) and other previous methods based on PSO IPSO (This work)
PSOTS (Shen et al.2)
IBPSO (Chuang et al.3)
PSOGA (Li et al.4)
Evaluation Leukemia
Classification accuracy (%) No. selected genes
(100) (3.5)
(98.61) (7)
100 1034
(95.1) (21)
Colon
Classification accuracy (%) No. selected genes
(94.19) (4.50)
(93.55) (8)
– –
(88.7) (16.8)
Data
Method
Note: The results of the best subsets are shown in shaded cells –, result not reported in the previous related work; ( ), average result; PSOTS, a hybrid of a PSO and tabu search; IBPSO, an improved binary PSO; PSOGA, a hybrid of PSO and GA
genes using IPSO are superior to those obtained with the standard version of binary PSO in terms of the best, average, and standard deviation results. For an objective comparison, we only compare our work with previous related work that used PSO in the methods.2–4 This is shown in Table 3. For the leukemia data set, the averages of LOOCV accuracy and the number of selected genes in our work were 100% and 3.5 genes, respectively. The latest previous work also came up with a similar LOOCV result to ours, but they used more than 1000 genes to obtain the same result.3 Overall, this work has outperformed previous related work on both the data sets in terms of LOOCV accuracy and the number of selected genes. According to Tables 1–3, IPSO is reliable for gene selection since it has produced the near-optimal solution from microarray data. This is due to a modification of the rule for updating particle positions that causes the selection of a small number of genes. Therefore, IPSO yields the optimal gene subset (a small subset of informative genes with high classification accuracy) for cancer classification.
4 Conclusion In this article, IPSO has been proposed and tested for gene selection on two real sets of microarray data. Based on the
experimental results, the performance of IPSO was superior to BPSO and related previous work. This is due to the fact that the modified rule for updating particle positions in IPSO causes a small number of genes to be selected in each iteration, and finally produces a near-optimal subset of informative genes for cancer classification. For future work, a combination between a constraint approach and PSO will be proposed to increase the classification accuracy.
References 1. Knudsen S (2002) A biologist’s guide to analysis of DNA microarray data. Wiley 2. Shen Q, Shi WM, Kong W (2008) Hybrid particle swarm optimization and tabu search approach for selecting genes for tumor classification using gene expression data. Comput Biol Chem 32: 53–60 3. Chuang LY, Chang HW, Tu CJ, et al (2008) Improved binary PSO for feature selection using gene expression data. Comput Biol Chem 32:29–38 4. Li S, Wu X, Tan M (2008) Gene selection using hybrid particle swarm optimization and genetic algorithm. Soft Comput 12:1039– 1048 5. Kennedy J, Eberhart R (1995) Particle swarm optimization. Proceedings of the IEEE International Conference on Neural Networks, vol 4, pp 1942–1948 6. Kennedy J, Eberhart R (1997) A discrete binary version of the particle swarm algorithm. Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, vol 5, pp 4104– 4108