Cancer Biomarker Assessment Using Evolutionary ...

Cancer Biomarker Assessment Using Evolutionary Rough Multi-Objective Optimization Algorithm Anasua Sarkar* Senior Member, IEEE, Assistant professor, Government College of Engineering and Leather Technology, Kolkata-98, West Bengal, India, Email: [email protected], [email protected] Ujjwal Maulik Senior Member, IEEE, Department of Computer Science and Engineering, Jadavpur University, Kolkata-700032, West Bengal, India, Email: [email protected] ABSTRACT A hybrid unsupervised learning algorithm, which is termed as Evolutionary Rough Multi-Objective Optimization (ERMOO) algorithm, is proposed in this article. It comprises a judicious integration of the principles of the rough sets theory with the archived multi-objective simulated annealing approach. While the concept of boundary approximations of rough sets in this implementation, deals with the incompleteness in the dynamic classification method with the quality of classification coefficient as the classificatory competence measurement, it enables faster convergence of the Pareto-archived evolution strategy. It incorporates both the rough set-based dynamic archive classification method in this algorithm. A measure of the amount of domination between two solutions has been incorporated in this work to determine the acceptance probability of a new solution with an improvement in the spread of the nondominated solutions in the Pareto-front by adopting rough sets theory. The performance has been demonstrated on real-life breast cancer dataset for identification of cancer associated fibroblasts (CAFs) within the tumor stroma and the identified biomarkers are reported. Moreover biological significance tests have been carried out for the obtained markers.

INTRODUCTION The progress of microarray technology in the field of cancer research has enabled scientists to measure the molecular signatures of cancer cells. The scientists today monitor the expression levels for differentially expressed cancer genes simultaneously over different time points under different drug treatments (Tusher, V.G., 1940). In microarray analysis, the expression levels of two genes may rise and fall synchronously in response to environmental stimuli (Tusher, V.G., 1940), (Eisen, M., 1998). The efficient machine learning classifiers help in the diagnosis of cancer sub types for patients (Spang, R. , 2003). In recent times, researchers experiment for developing computational methods for analysis of RNA and gene expression profiles for oncology detection. Such computational methods are expected to promote the experimental work that needs to be carried out in the wet laboratory for analyzing biomarker RNAs. Gene expression profiling of breast tumors stratifies into breast cancer of different molecular subtypes which

2 also co-segregate with the receptor status of the tumor cells. Therefore cancer associated fibroblasts (CAFs) within the tumor stroma may exhibit subtype specific gene expression profiles. These onco-RNA signatures may be further analyzed to find out the most significant oncological biomarkers computationally. Clustering is one unsupervised classification method based on maximum intra-class similarity and minimum inter-class similarity. Historically Eisen et al. (Eisen, M., 1998) first classified groups of coexpressed genes using hierarchical clustering. Other already proposed clustering, which can be applied for cancer subtype detection are - self-organizing map (SOM) (Spang, 2003), K-Means clustering (Tavazoie, S., 2001), (Hoon, M. J. L. de, 2004), simulated annealing (Lukashin, A., 1999), graph theoretic approach (Xu, Y., 1999), fuzzy c-means clustering (Dembele, D., 2003), spectral clustering(Maulik, U., 2013), (Sarkar, A., 2011), scattered object clustering ( de Souto, MCP., 2008) and symmetry based clustering(Maulik, U., 2012), (Sarkar, A., 2009). Several other methods like (Maulik, U., 2009), (SarKar, A, 2009), (Bandyopadhyay, S. 2010) are also which may be applicable efficiently for cancer subtype detection problem. Earlier approaches worked on detecting cancer biomarker using a hybrid method of Genetic Algorithm and all paired (AP) support vector machine (svm) approaches (Liu, J. J., 2005) . Multi-category classification SVM (MC-SVM) approaches have also been explored in Gene Expression Model Selector (GEMS) system (Statnikov, A., 2005). Feature ranking scores for feature selection have also been experimented in SVM-RFE approach (Duan, K. B. , 2005). Therefore, we can implement a method which would simultaneously optimize the features and select biomarkers considering multiple objectives. The multi-objective optimization (MOO) involves the simultaneous optimization of two or more conflicting objectives, forming the Pareto-optimal (PO) or non-dominated set of solutions with equal importance. The perspective is different from the single-objective optimization problem with only one global optimum, which lacks significance for most real-world problems with multiple objectives. Over the decade, a number of multi-objective techniques based on Evolutionary Algorithm (EA) have been suggested (Deb, K., 2001), (Coello, C. , 2002), (Srinivas, N. , 1995), (Zitzler, E. , 1998), (Maulik, U., 2010), (Ganesan, 2013). The unique features behind the success of EA for solving MOO problems is their population based nature and ability to find multiple Pareto-optima simultaneously. Some of them consider soft computing approaches (Elamvazuthi, 2013), (Jimenez, 2013). Simulated Annealing (SA) (Kirkpatrick, S. , 1983) is another method utilizing principles of statistical mechanics to find minimal cost solutions for large optimization problems by minimizing the associated energy. Since each execution of the SA method converges to one global optimum solution, the MultiObjective Simulated Annealing (MOSA) approach evolves the PO set of solutions in multiple SA runs. But the diversity of PO set of solutions suffers for those independent runs. Therefore, Pareto-dominance based acceptance criterion has been incorporated in MOSA (Suman, B., 2005), (Smith, K. , 2004). To consider the amount of domination between new solution and the PO solutions stored in Archive including the current solution, a novel Archived Multi-Objective Simulated Annealing (AMOSA) algorithm has been proposed recently by Bandyopadhyay et.al (Bandyopadhyay, S. 2008) in 2008. This improved method utilizes a domination based energy function with different forms of acceptance probabilities depending on the domination status, explained later on. The AMOSA approach utilizes the average linkage clustering method for limiting the number of non-dominated solutions in the Archive to a pre-defined fixed limit, which lacks automatic selection of diversified core solutions to represent the exact spread of Pareto front. Rough set theory (Pawlak, Z., 1991) is a new paradigm to deal with uncertainty, vagueness and incompleteness of computing lower and upper approximations of rough sets. Rough sets theory was proposed by Pawlak (Pawlak, Z., 1991). Therefore, recently rough set theory is being used for clustering

3 (Dembele, D., 2003), (Qin, J., 2003), (Sarkar, A., 2013). Hirano and Tsumoto proposed an indiscernibility based clustering method that can handle relative proximity. Lingras (Xu, Y., 1999), (Dembele, D., 2003), (Qin, J., 2003) used rough set theory to develop interval representation of clusters. This model is useful when the clusters do not necessarily have crisp boundaries. In several different approaches, rough set theory has been incorporated with multi-objective differential evolution (MODE) theory (Alfredo, G. , 2006), particle swarm optimization (PSO) theory (Lin, W. , 2008) or independent component analysis (ICA) theory for classificatory decomposition of signals (Smolinski, T. G. , 2006). In this article, this theory can be utilized to find solutions within the most promising neighborhood of a reference set. The potential of rough set theory as a local optimizer using classificatory analysis can extend the approximation of diversified Pareto front produced by the SA-based MOEA approach. The main research goal here is to utilize the accuracy of classification coefficient of rough set as the dynamic classificatory measure (Smolinski, T. G. , 2006); it preserves the criteria of offspring generation within the most promising approximately bounded decision variable space. Therefore we implement this new rough set-based extension, named as ERMOO algorithm. For performance analysis of this ERMOO implementation over breast cancer gene expression profile three performance metrics, namely convergence, spacing and displacement are shown. The performance of the proposed method has been demonstrated on publicly available breast cancer RNA expression datasets for cancer associated fibroblasts (CAFs) within the tumor stroma. The experimental results exhibit the efficiency of the proposed technique. First the experiments have been conducted for discovering RNA biomarkers (CAFs) that distinguish the CAFs and normal breast fibroblastswith variations among breast cancer subtypes. Subsequently, the biological significance tests have been conducted for the selected biomarkers.

BACKGROUND Multi-objective optimization algorithm is an approach to find solutions which would give the values of all the objective functions acceptable to the decision maker (Coello, C. , 2002). In MOO algorithms , the goal is to optimize more than two objective functions simultaneously. Thus, in this context, one finds out good compromises (or trade-offs) rather than a single solution as in global optimization of a single objective function. This notion is called the Edgeworth-Pareto optimum or Pareto optimum. Pareto optimal solution refers to a solution if there exists no other feasible solution, which would reduce some objective function without increasing at least one other objective function value (assuming minimization). The vectors for these Pareto optimal solutions are called nondominated. When plotted in the objective function space, these nondominated vectors collectively shows Pareto front. There exist a canopy of methods which are proved to be efficient in the long run for cancer subtype detection for various kind of cancer, like leukemia, ovarian, colorectal or breast cancer. Recent works in this field intend to be more specialized both in methods and outputs. Salazar et. al (Salazar, R., 2012) recently develop a gene expression classifier to predict the early-stage of colorectal cancer (CRC). They develop a prognostic classifier with an initial signature, which succeeds to detect the risk of cancer recurrence without prescreening for microsatellite instability (Salazar, 2012). There are five molecular subtypes for breast cancer detected in an intrinsic gene list (Mackay, A., 2011) utilizing the inter observer agreement among five breast cancer researchers. Mackay et. al (Mackay, A., 2011) perform a hierarchical clustering using the Kappa statistics for perfect agreement to solve this problem.

4 Enhanced work on colorectal cancer recently leads to utilization of oncogenic MiRNAs which are highly expressed in cancer stroma (Nishida, N., 2012). Predicting the putative targets, Nishida et. al (Nishida, N., 2012) reveals that the downregulation of specific MiRNA clusters in cancer stroma is associated with clicopathological factors. Arraymining.net(Glaab, E. , 2009) is a web-based application with ensemble learning and consensus clustering for microarray analysis over benchmark dataset. Gong et. al (Gonge, 2011) proved that the gene regulatory modules related to endocrine therapy of breast cancer can be predicted with the motif-guided sparse decomposition method in clustering. Phenotypic Upregulated Gene Support Vector Machine (Yu, G. , 2011) with multiclass gene marker feature and high gene space dimensionality handling prove to be effective in multiclass gene selection for prostate and uterine cancers. For genomic data integration with copy number profiling measurement, Lathi et. al (Lathi, 2012) provide a transparent benchmarking procedure in cancer gene prioritization. Heuristic Breadth-first Search Algorithm has been experimented for robust tumor classification recently (Wang, 2012), which uses gene ranking for finding most important gene subset for robust tumor. Another approach uses sparse optimal scoring (SOS) for multi-class cancer characterization (Leng, C., 2008), based on Fischer’s Linear Discriminant analysis (LDA). (Zhoo, 2009) uses memtic approach for simultaneous identification of full class relevant (FCR) and partial class relevant (PCR) features in multiclass problems. (Li, S., 2014) performed a survey on evolutionary algorithm based approaches for bioinformatics works in detail. Therefore, all these recent research works in gene priority detection for cancer subtypes prediction prove to be efficient in their analysis in search of the oncological significance.

ARCHIVED MULTI-OBJECTIVE SIMULATED ANNEALING : AMOSA The Archived Multi-objective Simulated Annealing (AMOSA) algorithm (Bandyopadhyay, S. 2008) incorporates a concept of archive to provide a trade-off solution and another novel concept of amount of dominance in order to determine acceptance of a new solution. The PO solutions are stored in an archive, which has a limited archivesize since finally a limited number of well distributed PO solutions are needed. Two limits are kept for the archive size: a hard or strict limit HL and a soft limit SL. During the process, when the number of nondominated solutions in archive exceeds SL, the archive size is thereafter reduced to HL by applying average linkage clustering method. The AMOSA algorithm provides a tradeoff solution to the problem using an Archive based approach incorporating simulated annealing with multi-objective approach. Amount of dominance has been computed to determine the acceptance probability of new solution with current solution and other solutions in Archive. This parameter is measured in terms of the hyper volume between two solutions in the objective space. This is a nonzero probability measure, which makes the problem less greedy in nature. In the initialization phase, each solution in archive is refined for a number of iterations using a hillclimbing technique by accepting a new solution only if it dominates the previous one. Exceeding SL number of solutions in archive leads to clustering to explicitly enforce the diversity of non-dominated solutions. AMOSA incorporates the concept of amount of domination in computing the acceptance probability of a new solution. Given two solutions a and b, the amount of domination is defined in Equation 1 below as

∏

| ( )

( )

( )

, where M= number of objectives and Ri = range of ith objective.

( )|

(1)

5 In the main AMOSA process, one randomly selected point from Archive, called current-pt is perturbed to generate a new solution new-pt with its dominance computation with respect to the current-pt and solutions in Archive. Based on the domination status between current-pt and new-pt three cases may arise, as mentioned in (Bandyopadhyay, S. 2008), as shown in Figure 1. As in SA, this process is repeated iter times for each temperature temp, starting from Tmax reduced to ɑ X temp till Tmin. Then the process stops, producing final non-dominated solutions in the Archive. The time complexity of AMOSA algorithm for N objectives is O (iter ∗ N ∗ (N + log(S))) (Bandyopadhyay, S. 2008) ,where SL = β ∗ HL, β = 2 and HL = S. Here S is denoted as the population size.

Figure 1 Steps of main AMOSA process.

ROUGH SETS THEORY

6 The theory of rough sets begins with the notion of an indiscernibility relation, which induces a pair ⟨U,R⟩, where U is a nonempty set (the universe of discourse) and R is an equivalence relation on U. R decomposes the set U into disjoint equivalence classes, denoted as [x]B (for some object x described by a set of attributes B). Let U/R ={X1,X2,···,Xm}, where Xi is an equivalence class of R, i = 1,2,···,m. We have shown the steps for computing the rough set theory based on R in Figure 2.

Figure 2 Steps for rough sets theory based on R The main goal of rough set analysis is to synthesize approximations of concepts from acquired data. Consequently given an arbitrary set X ⊆ U can be approximated by using the information contained in B by defining a pair of B−lower and B−upper approximations of X (Pawlak, Z., 1982), (Pawlak, Z., 1991), denoted by BX and BX respectively.

B (X) = {x∈U |[x]B ⊆ X} = ∪{[x]B ∈U/R |[x]B ⊆X} (2) B (X) = {x∈U | [x]B∩X = ϕ} = ∪{[x]B ∈ U/R | [x]B ∩X = ϕ} (3)

7 The lower approximation B X is the union of all the elementary sets, which are subsets of X and the upper approximation B X is the union of all the elementary sets, which have a non-empty intersection with X. The interval | B X, B X | simply is called the rough set of X. In (Pawlak, Z., 1982), (Pawlak, Z., 1991), Pawlak discussed two numerical characterizations for imprecision of X: accuracy and roughness. Accuracy of approximation for X, denoted by αB(X), is the ratio of objects on its lower approximation to that on its upper approximation. αB(X)= | B X| / | B X|. (4) The roughness of X, denoted by ρB(X), is defined as ρB(X) = 1 − αB(X). Note that the lower the roughness of a subset, the better its approximation. Each decision d determines a classification of objects in X (Pawlak, Z., 1982), (Pawlak, Z., 1991) as CLASSR(d) = {X1R,···,Xr(d)R } of the universe U, where XkR = {x ∈ U | d(x) = vk d} for 1 ≤k ≤ r(d), where vk d denotes kth value for decision d and r(d) denotes the rank of decision d. A classification can be characterized numerically by the quality of classification as the fitness function measure defined below: γB(X)= | B X ∪ B ¬X| / |U| (5) ,where B ¬X is the lower approximation of set of objects that do not belong to X.

PROPOSED APPROACH

EVOLUTIONARY

ROUGH

MULTI-OBJECTIVE

OPTIMIZATION

Based on the archived MOO algorithm AMOSA (Bandyopadhyay, S. 2008), in this article a new rough set based Evolutionary Rough Multi-Objective Optimization ERMOO approach has been proposed. However, the original AMOSA algorithm utilizes the average linkage clustering method for reducing the number of solutions from soft limit SL to hard limit HL in the Archive. It lacks in the automatic determination of the number of non-dominated solutions to be stored in Archive. It also lacks in the automatic elimination of those redundant solutions, which lies outside the core set of the unsupervised Archive. Therefore the rough set based automatic core detection method for an unsupervised Archive, can be utilized in new ERMOO method to get a proper diversified PO front. Based on these observations, a novel rough set based archived multi-objective simulated annealing ERMOO approach as shown in Figure 3 has been proposed in this section.

Rough Set Based Dynamic Classification of Unsupervised Archive To explicitly enforce the diversity of the non-dominated solutions, rough set based classification rules on the Archive has been incorporated in ERMOO. The size of Archive is allowed to increase upto previously defined soft limit SL (>HL). Then the solutions are grouped into automatically determined rough limit RL (SL > RL > HL) clusters using rough set based classification rules. To expand the Archive size upto SL or RL enables more spread clusters formation with better diversity. To intensify the local search in the area where non-dominated solutions reside, the algorithm should refuse finding more solutions from rest of the database in the area where the dominated solutions reside. For this purpose rough set based unsupervised classification rules have been used. Range Initialization

8 Given a decision system β = (U, B ∪{d}) in general B is called the set of conditional attributes and d  B is called the decision attribute. To initialize, all patterns have been assigned a single class in the decision attribute. The classes of the temporal or clinically conditioned features of the cancer gene expression profile data sets are transformed into the decision variables.

FIGURE 3 Steps of ERMOO Algorithm.

9

Compute Boundary Approximations To build the feasible set boundaries, the following B−lower and B−upper approximations for each decision variable set Xi, denoted by B Xi and B Xi respectively, are computed: B Xi = {x|[x]B ⊆ Xi} and

B Xi = {x|[x]B ∩Xi  ϕ}. Here [x]B denotes some value x described by a set of attributes B. For each set Xi, we compute the accuracy measurement in Eq. 4. Compute Quality of Classification and Classification Error To solve the problem of maximizing the classificatory competence of the decomposition scheme, the rough sets-based quality of classification coefficient as defined in Eq. 5 is used for estimating classificatory aptitude. To reduce the dimensionality of a given problem does not assure the classificatory usefulness of the resultant set. Therefore in the EMOO approach proposed here, the sparseness term of the basis function coding is replaced by the rough sets-based data reduction-driven classification accuracy measure as defined above in Eq 5. The quality of classification coefficient is computed directly on the potential global optimal reduct and consequently on the dynamically obtained classes or clusters generated with maximum strength of rules. The classification error 1 –

γR

on the candidate reduct R is computed to

assure that the result will be both valid for new solution generation and useful for the classification task. Generate Offspring Within Bounds Inside the boundary approximations of each decision variable, the Offspring new solutions are randomly chosen from rest of the solutions. Then the solutions are checked if they belongs to non-dominated or dominated sets. The key point of the process is the fact that the Offsprings, randomly generated within the bounds, are located in the most promising areas with some diversity incorporated by this random feature.

PERFORMANCE ANALYSIS Experimental Framework All our experiments are performed on Apple MacBook, with quad-core processor.

Performance Metrics The fitness of a solution indicates the degree of goodness of the solution of the proposed algorithm (Young, 2001). In Multi-Objective optimization (MOO) strategy, the resultant solution set should converge as close to the true PO front as possible, as well as it should maintain as diverse solution set as possible. The first condition ensures the obtained solutions are near optimal and the second condition ensures the trade-off of wide range solutions is obtained. Any one of the existing performance measures can not achieve both these challenges adequately. Therefore three such performance measures are used in this ERMOO approach, as mentioned below: i.

Convergence - The Convergence measure Γ estimates the extent of convergence of the obtained solutions set to a known set of PO solutions. For each solution obtained with an algorithm, the minimum Euclidean distance from chosen solutions on the Pareto-optimal front is computed. The average of these distances is used as the metric Γ. Lower the value of Γ, better is the convergence of the obtained solution set to the true PO front.

10 ii.

Spacing - This measure was first proposed by Schott. It reflects the uniformity of the solutions over the non-dominated front. Smaller values of Spacing (S) indicate better performance. | | √| | ∑ ( - ) (7)

where

iii.

* ∈

+

∑*

+|

–

|

and

(or

) is the m th objective value of

the i th (or k th) solution in the final non-dominated solution set Q. d is the mean value of all di s. Displacement - The Displacement measure is suitable for even discontinuous PO front, as it measures how far the obtained solution set is from a known set of PO solutions. Lower the value of this measure, better is the convergence to and the extent of coverage of the true PO front. It is defined in Equation 8 as follows: | ∗|

∑|

∗|

(

| |

(

)) (8)

*

where P consists of uniformly spaced solutions from the PO front in the objective space, Q is the obtained set of solutions and d(i, j) is the Euclidean distance between the i - th solution P* and j th solution of Q.

Data Set We obtained a publicly available gene expression profile from the following website: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37614. The complete dataset consists of 16653 human cancer associated fibroblasts (CAFs) within the tumor stroma. From this, we compare the gene expression profiles on RNA isolated from early passage primary CAFs isolated from twenty human breast cancer samples representing three main subtypes; seven ER+, six triple negative (TNBC) and six Her2+ receptors.

RESULTS The chosen cancer dataset has been experimented using K-Means algorithm, Hierarchical clustering and PCA methods.

Figure 4 Heatmap of K-Means clustering results over breast cancer CAFs dataset, showing hierarchical clustering results in left side

11

Figure 5 Silhouette width of clusters produced by K_Means algorithm over breast cancer CAFs dataset

Figure 6 Principal component analysis results over breast cancer CAFs dataset showing percentage of variance in right side image The clustering solutions of K-Means algorithm have been depicted in figure 4 as a heatmap. The hierarchical clustering results over this dataset have been shown on left side of the heatmap. The Silhouette width computed over K-Means clustering solutions have been shown in figure 5. We perform principal component analysis over the chosen dataset and plot its first two components in figure 7. The variance of the standard deviations of the PCA results has been depicted in right side of figure 6. The first

12 two principal components are plotted in figure 7. In the figure the solutions are colored according to KMeans clustering results.

Figure 7 Plot showing standard deviation over principal components with variables colored followed by K-Means clustering solutions The proposed ERMOO method is executed on this cancer RNA expression profile for multiple times and for each run, the output set dominant solutions are collected. We found five RNAs to be dominant in final archive using our new ERMOO algorithm and these 5 RNAs are selected as final set of RNA cancer biomarkers. Table 1 shows the ERMOO classifier performance on this dataset for already defined performance metrics. Dataset Convergence Spacing Displacement RNA expression 594629.498 0.885520 8.074 profiles for breast cancer associated fibroblasts (CAFs) Table 1 Performance metrics on clustering solutions provided by ERMOO algorithm on chosen dataset One good MOO algorithm should converge as close to true PO front as possible with a maximal diverse solution set. The Convergence measure shows the extent of convergence to a known PO solution set. Spacing is a measure which reflect the uniformity of solutions over nondominated solution front. Its smaller value indicates a good MOO algorithm. The distance between obtained solution set and known set of PO solutions is measured using Displacement metrics. The new ERMOO algorithm provides lower spacing and displacement metric values to show its significant in generating compact final archives.

13 DISCUSSION The ERMOO algorithm is a multi-objective simulated annealing approach with rough set based decision rule generation strategy. Let Tcomp is the computational time complexity of the algorithm. Then the time complexity of ERMOO as illustrated in Figure 3, is analyzed below: 1. Archive initialization: Tcomp = O(SL). 2. Multi-objective simulated annealing approach: Tcomp= O((N ∗ RL + N) ∗ iter) where iter = Number of iterations in ERMOO approach. 3. Rough set based dynamic classification: (a) Range initialization: Tcomp = O(N) for N objectives. (b) Boundary approximations: Tcomp = O(N ∗ RL) since lower and upper approximations computation takes O(RL) time for each objective. (c) Quality of classification measure computation, automatic rough limit determination and offspring generation within bounds: Tcomp= Const. 4. Final Archive computation: Tcomp= Const. Let θ = N∗ RL. Since the rough set based dynamic classification method is executed after each (SL − RL) number of iterations and temp_interval = (Tmax − Tmin) / α, the total time complexity of ERMOO approach, considering the roughset based dynamic classification time is: TTotal = Σsteps Tcomp = O(SL )+ O (((N + RL ∗ N + Const) ∗ iter)∗ temp_interval) +O((N ∗ RL ∗ (iter / (SL − RL))∗ temp_interval) = O(SL) + O(((N + θ) ∗ iter)∗ temp_interval) +O(θ ∗ (iter / (SL − RL)) ∗ temp_interval) = O(θ ∗ iter ∗ temp_interval)

Biological Relevance To study how the selected RNA biomarkers are involved in various biological activities, we have studied the KEGG pathway enrichment of the target genes of each of the selected RNAs using Database for Annotation, Visualization and Integrated Discovery (DAVID) available at http://david.abcc.ncifcrf.gov/. Table 2 reports the results of this study. Sl. No.

Probe ID.

1

ILMN_ UDP-glucose 1651259 pyrophospho rylase 2

2

ILMN_ hypothetical 1651259 LOC647115 ILMN_ fms-related

3

Gene Name

Official Gene Symbol UGP2

Entrez Gene ID

KEGG pathways

7360

hsa00040:Pentose and glucuronate interconversions, hsa00052:Galactose metabolism, hsa00500:Starch and sucrose metabolism, hsa00520:Amino sugar and nucleotide sugar metabolism

FLJ36848

651125

Flt4

2324

hsa04060:Cytokine-cytokine

Cancer type association

Hemangioma,

14

4

5

6

7

8

1651296 tyrosine kinase 4 ILMN_ hypothetical 1651296 protein LOC143666 ILMN_ secreted 1651354 phosphoprote in 1

receptor interaction, hsa04510:Focal adhesion LOC143666

143666

SPP1

6696

Lymphedema (OMIM)

hsa04510:Focal adhesion, Glaucoma, hsa04512:ECM-receptor liver cancer interaction, (GADDC) hsa04620:Toll-like receptor signaling pathway

ILMN_ receptor reep2 51308 1666057 accessory protein 2 ILMN_ hyaluronan HAPLN1 1404 1678812 and proteoglycan link protein 1 ILMN_ ring finger RNF214 257160 1800420 protein 214 Table 2 Biomarkers selected by ERMOO method.

Breast and prostate cancer (OMIM)

The KEGG analysis results in Table 2 shows eight selected probe ID using our ERMOO approach. Their respective Gene names, Official gene symbols and Entrez gene IDs are also shown in Table 2. We have reported the top three significant pathways for the target genes as obtained from DAVID. Several metabolic pathways have been obtained among those eight selected probes. UGP2 gene is involved in hsa00040: Pentose and glucuronate interconversions pathways, while Flt4 gene is involved in hsa04060: Cytokine-cytokine receptor interaction pathway. The KEGG analysis also show that both Flt4 and SPP1 genes are involved in hsa04510: Focal adhesion pathway. It is really interesting to see that cancer type associations come for three selected RNA biomarkers, i.e. ILMN_1651296, ILMN_1651354 and ILMN_1678812. This signifies that the selected RNA markers are indeed involved in different cancer pathways. Specific cancer associations are noticed displayed in Table 2 for all three markers from OMIM disease and other GENETIC_ASSOCIATION_DB_DISEASE_CLASS (GADDC) databases. These results indicate that the selected RNA biomarkers are highly involved in different types of cancer, thus can really be treated as potential RNA biomarkers. The analysis shows that Flt4 gene has association with Hemangioma and Lymphedema cancer types, while SPP1 is associated with Glaucoma and liver cancer. Similarly, HAPLN1 is associated with Breast and Prostate cancer. The KEGG pathways available are also shown in Table 2 for these three chosen RNAs. The GO analysis over these RNAs is shown in Table 3. The Component, Process and Function analysis results in terms of GO terms also show the significances of chosen RNAs in their respective biological activity involvements. The GO analysis for ILMN_1678812 provides GO:0007155~cell adhesion for Ontology Process and GO:0005539~glycosaminoglycan binding annotations. Therefore, this RNA may be involved in cell wall modification in breast and prostate cancer cells, as it has GO:0005576~extracellular region for its Ontology component annotation. The GO process annotation GO:0001558~regulation of cell growth for ILMN_1651354 also emphasizes its involvement in Glaucoma and liver cancer. Similar analysis can also be done on other chosen biomarkers in Table 3.

15

Probe ID. ILMN_1651259

Entrez Gene ID 7360

ILMN_1651296

2324

ILMN_1651354

6696

ILMN_1666057

51308

ILMN_1678812

1404

Ontology_Component

GO:0005576~extracellular region, GO:0005615~extracellular space, GO:0031982~vesicle, GO:0031988~membranebounded vesicle, GO:0042995~cell projection, GO:0044421~extracellular region part, GO:0045177~apical part of cell, GO:0048471~perinuclear region of cytoplasm GO:0016021~integral to membrane, GO:0031224~intrinsic to membrane

Ontology_Process GO:0005975~carbohy drate metabolic process, GO:0005996~monosac charide metabolic process, GO:0006006~glucose metabolic process GO:0006464~protein modification process, GO:0006468~protein amino acid phosphorylation, GO:0006793~phospho rus metabolic process, GO:0000003~reproduc tion, GO:0001501~skeletal system development, GO:0001503~ossificati on, GO:0001558~regulatio n of cell growth

GO:16020~Double layer of lipid molecules, GO:16021~ Penetrating at least one phospholipid bilayer of a membrane

Ontology_Function

GO:0005125~cytoki ne activity, GO:0050840~extrac ellular matrix binding

GO:16020~Double layer of lipid molecules, GO:16021~ Penetrating at least one phospholipid bilayer of a membrane GO:0005576~extracellular GO:0007155~cell GO:0001871~pattern region, adhesion, binding, GO:0005578~proteinaceous GO:0009987~cellular GO:0005539~glycos extracellular matrix, process, aminoglycan GO:0031012~extracellular GO:0022610~biologic binding, matrix, al adhesion, GO:0005540~hyalur GO:0044421~extracellular onic acid binding, region part GO:0030246~carboh ydrate binding,

16 GO:0030247~polysa ccharide binding, ILMN_1800420 257160 GO:0008270~zinc ion binding, GO:0043167~ion binding, GO:0043169~cation binding, GO:0046872~metal ion binding, GO:0046914~transiti on metal ion binding Table 3 GO annotation analysis for finally selected six probes by ERMOO approach

FUTURE RESEARCH DIRECTIONS Cancer gene expression data analysis is one of the open challenges in computational oncology (Salazar, R.et. Al, 2012), (Mackay, A., 2011), (Nishida, N., 2012), (Gonge, 2011), (Yu, G. , 2011), (Lathi, 2012), (Souto, 2008). It provides a powerful tool by which the expression patterns of cancer samples across multiple conditions can be monitored simultaneously. As a scope of further research, the experiments on the performance rough set based decision rules with different state-of-the-art multi-objective classifiers (Jimenez, 2013), (Maulik, U., 2010), (Alfredo, G. , 2006), (Coello, C. , 2002), (Deb, K., 2001) can be investigated. The soft computing approach enhances the efficiency of the experiments on real-life fuzzy cancer data. Therefore, utilizing rough set based rule generation strategies, researchers can predict and analyze oncological datasets more efficiently. Moreover, the RNA biomarkers identified in this present research work, are also needed to be further investigated biologically. The concept of full class relevant (FCR) and partial class relevant feature (PCR) in (Zhoo, 2009) can be incorporated in our present approach using rough set based decision rules. The core on the dataset as selected by the decision rules on rough sets can be further utilized to gain significant gene/RNA subsets relevant for onco-analysis. The inconsistency ratio (IR) for nodes in metabolic network can be further enhanced using concepts of belongingness to a rough set (Li, S., 2014). The rough sets constructed on cancer datasets can further be patient specific using rough set based decision rule creation for detecting and predicting cancer types, remedies or analyzing responses to particular drugs.

CONCLUSION Cancer gene expression data analysis is one of the latest breakthroughs in experimental molecular biology (Tyson, J.J., 2012), (SarkaR, A, 2009), (Ghorai, S. , 2011), (SarkAr, A, 2009), (Souto, 2008), (Bandyopadhyay, S., 2007), (Maulik, U., 2010). The expression patterns of cancer samples are monitored simultaneously to detect cancer biomarkers. The contribution of this article lies in developing an evolutionary multiobjective optimization algorithm enhanced with rough set based classification rules on the Archive to enforce the diversity of the nondominated solutions. The proposed method has been experimented for identification of cancer RNA biomarkers from breast cancer RNA expression datasets of cancer associated fibroblasts (CAFs) and the results have been demonstrated. The technique optimizes RNA expression values in rough set based

17 approximated Archive and evolves the desired subset of features (RNAs). Moreover, three identified RNA cancer biomarkers are also found to have association with different types of cancer as per recent disease related database. Finally a pathway enrichment study has been conducted that reveals that the genes targeted by selected three RNA biomarkers, which are involved in several cancer pathways. GO annotations also reveals the biological significance of theses biomarkers. As a scope of further research, performances rough set based decision rules with different popular multiobjective classifiers are to be compared (Bandyopadhyay, S., 2008). Moreover, the RNA biomarkers identified are needed to be further investigated biologically.

REFERENCES Liu, J. J., Cutler, G., Li, W., Pan, Z., Peng, S., Hoey, T., Chen, L., & Ling, X. B. (2005). Multiclass cancer classification and biomarker discovery using GA-based algorithms, Bioinformatics,21(11), 2691-7. Bandyopadhyay, S., Maulik, U., & Roy, D.(2008). Gene Identification: Classical and Computational Intelligence Approaches. IEEE Transactions on Systems, Man and Cybernetics, Part C, 38(1), 55-68. (2007). Analysis of Biological Data: A Soft Computing Approach, World Scientific, Singapore. Maulik, U., Bandyopadhyay, S., Wang, Jason T. L. (2010). Computational Intelligence and Pattern Analysis in Biology Informatics, Wiley Interscience, USA. Statnikov, A., Aliferis, C. F., Tsamardinos, I., Hardin, D., & Levy, S. (2005). A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis, Bioinformatics, 21(5), 631-43. Duan, K. B. , Rajapakse, J. C., Wang, H. , & Azuaje, F. , (2005). Multiple SVM-RFE for gene selection in cancer classification with expression data, IEEE Trans Nanobioscience. Vol 4( 3), 228-34. Leng, C., (2008). Sparse optimal scoring for multiclass cancer diagnosis and biomarker detection using microarray data, Computational Biology and Chemistry, 32, 417-425. Li, S., Kang, L., & Zhao, X.-M. , (2014). A Survey on Evolutionary Algorithm Based Hybrid Intelligence in Bioinformatics, BioMed Research International, 2014, Article ID 362738, 8 pages. Maulik, U. , Mukhopadhyay, A., & Bandyopadhyay, S., (2009). Combining Pareto-Optimal Clusters using Supervised Learning for Identifying Co-expressed Genes, BMC Bioinformatics, 10:27. Bandyopadhyay, S., Saha, S., Maulik, U., & Deb, K.. (2008). A Simulated Annealing Based Multiobjective Optimization Algorithm: AMOSA, IEEE Transaction on Evolutionary Computation, vol 12(3), 269-283. Bandyopadhyay, S., Mitra, R., Maulik, U. , & Zhang, M. Q. , (2010). Development of the Human Cancer microRNA Network, Silence, 1(6). Su, M C, & C H Chou & C C Hsieh. (2005) Fuzzy C-Means algorithm with a point symmetry distance. International Journal of Fuzzy Systems. 7. no. 4. 175–181. Tusher, V.G., Tibshirani, R. & Chu, G. (1940) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 98(9), 5116–5121. Spang, R. (2003) Diagnostic signatures from microarrays: a bioinformatics concept for personalized medicine, BIOSILICO, 1( 2), 64–68. Eisen, M. , Spellman, P. , Brown, P., & Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA. 95, 14863–14868. Tavazoie, S. , Hughes, J. , Campbell, M. , Cho, R., & Church, G. (2001) Systematic determination of genetic network architecture. Bioinformatics. 17, 405–414. Hoon, M. J. L. de, Imoto, S. , Nolan, J. & Miyano, S. (2004) Open source clustering software. Bioinformatics. 20( 9), 1453–1454. Lukashin, A., & Futchs, R. (1999) Analysis of temporal gene expression profiles: clustering by simulated annealing and determining optimal number of clusters. Nat. Genet,. 22, 281–285. Xu, Y., Olman, V., & Xu, D. (1999). Clustering gene expression data using a graph theoretic approach: an application of minimum spanning trees. Bioinformatics, 17. 309–318.

18 Dembele, D., & Kastner, P. (2003) Fuzzy c-means method for clustering microarray data. Bioinformatics, 19, 973–980. Qin, J., Lewis, D. , & Noble, W.(2003) Kernel hierarchical gene clustering from microarray gene expression data. Bioinformatics, 19, 2097–2104. Giraud-Carrier, C., Vilalta, R., & Brazdil, P. (2004) Introduction to the special issue on meta-learning. Machine Learning, 54( 3), 187–193. de Souto, MCP., R, RBCP., Soares, RGF., de Araujo, DSA., Costa, IG., Ludermir, TB. & Schliep, A. (2008) Ranking and Selecting Clustering Algorithms Using a Meta-Learning Approach. In Proc. of IEEE International Joint Conference on Neural Networks. IEEE Computer Society, 3728–3734. Jain, A. K., & Dubes, R. C. (1998) Algorithms for clustering data. Englewood Cliffs. NJ: Prentice-Hall. Bandyopadhyay, S., Maulik, U., Wang, Jason T. L.. (2007) Analysis of biological data : a soft computing approach. science. engineering. and biology informatics , 3 ed. Toh Tuck Link. Singapore: World Scientific Publishing Co. Zhu, Z. , Ong, Y.-S. , & Kuo, J.-L. (2009) Feature Selection Using Single/Multi-Objective Memetic Frameworks, Multi-Objective Memetic Algorithms, in Series of Studies in Computational Intelligence, 171, 111-131. Duda, R. O. , Hart, P. E. , & Stork, D. G. (1981) Pattern classification and scene analysis, New York: Wiley. Gath, I. & Geva, A. (1989) Unsupervised optimal fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence (11), 773–781. Dave, R. N. (1989) Use of the adaptive fuzzy clustering algorithm to detect lines in digital images. Intell. Robots Comput. Vision VIII, 1192, 600–611. Man, Y. & Gath, I. (1994). Detection and separation of ring-shaped clusters using fuzzy clustering. IEEE Trans. Pattern Anal. Mach. Intell., 16. (8), 855–861. Su, M.-C., & Chou, C.-H. (2001). A modified version of the k-means algorithm with a distance based on cluster symmetry. IEEE Transactions on Pattern Analysis and Machine Intelligence. 23( 6), 674–680. Sarkar, A. & Maulik, U. , (2013) Cancer Gene Expression Data Analysis Using Rough Based Symmetrical Clustering, in Handbook of Research on Computational Intelligence for Engineering, Science, and Business, Chapter: 27, Publisher: IGI Global, Editors: Siddhartha Bhattacharyya (RCC Institute of Information Technology, India) and Paramartha Dutta (Visva-Bharati University, 699-715. Pawlak, Z. (1982). Rough sets. International Journal of Computer and Information. 11, 341–356. Pawlak, Z. (1991). Rough sets: Theoretical aspects of reasoning about data. Kluwer Academic Publishers. Gawrys, M. , & Sienkiewicz, J. (1994). Rsl–the rough set library version 2.0. ICS Research Report 27/94 Warsaw. Poland: Institute of Computer Science. W. U. of T.. Pacheco, P. (1997). Parallel programming with MPI. Morgan Kaufmann. Young, K. Y. (2001) Validating clustering for gene expression data. Bioinformatics, 17.309–318. Bezdek, J. C .(1981). Pattern recognition with fuzzy objective function algorithms. New York: Plenum. Xie, X. L. & Beni, G. (1991). A validity measure for fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13, 841–847. Maulik, U. & Bandyopadhyay, S. (2001). Nonparametric genetic clustering : comparison of valdity indices. IEEE Transactions on Systems. Man. and Cybernetics-Part C: Applications and Reviews, 31(1), 120–125. Rousseeuw, P. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math, 20, 53–65. Hollander, M. & Wolfe, D. (1999). Nonparametric statistical methods. 2nd ed. USA: Wiely. Maulik, U. & Bandyopadhyay, S.. Performance evaluation of some clustering algorithms and validity indices. IEEE Transactions on Pattern Analysis and Machine Intelligence.24(12),1650– 1654. Cordasco, G., Scarano, V. & Rosenberg, A. L.(2007) Bounded-collision memory-mapping schemes for data structures with applications to parallel memories. IEEE Trans. Parallel Distrib. Syst..18(7), 973–982.

19 Oleszkiewicz, J.,Xiao, L. & Liu, Y. (2006) Effectively utilizing global cluster memory for large dataintensive parallel programs. IEEE Trans. Parallel Distrib. Syst,. 17(1), 66–77. Salazar, R., Roepman, P., G. Capella, V. Moreno, I. Simon,C. Dreezen, A. Lopez-Doriga, C. Santos, C. Marijnen, J. Westerga, S. Bruin, D. Kerr, P. Kuppen, C.van de Velde, H. Morreau, L. V. Velthuysen, A. M. Glas, L. J. V. Veer & R. Tollenaar. (2011) Gene Expression Signature to Improve Prognosis Prediction of Stage II and III Colorectal Cancer. Journal of Clinical Oncology, 29(1), 17-24. Mackay, A., Weigelt, B., Grigoriadis, A., Kreike, B. , Natrajan, R. , A’Hern, R., Tan, D. S.P., Dowsett, M., Ashworth, A. & Reis-Filho, J. S.(2011) Microarray-Based Class Discovery for Molecular Classification of Breast Cancer: Analysis of Interobserver Agreement. JNCI Journal of The National Cancer Institute, 103(8), 662-673. Nishida, N., Nagahara, M., Sato, T., Mimori, K., Sudo, T. , Tanaka, F. , Shibata, K. , Ishii, H., Sugihara, K. , Doki, Y. , & Mori, M. (2012). Human Cancer Biology: Microarray analysis of colorectal cancer stromal tissue reveals upregulation of two oncogenic microRNA clusters. Clinical Cancer Research, 1078. Published Online. Tyson, J.J., Baumann, W.T. , Chen, C. , Verdugo, A. , Tavassoly, I. , Wang, Y. , Weiner, L.M., & Clarke, R.(2011) Dynamic models of estrogen signaling and cell fate in breast cancer cells. Nature Review Cancer, 11, 523-532. Gong, T. , Xuan, J. , Chen, L. , Riggins, R.B. , Li, H. , Hoffman, E.P. , Clarke, R., & Wang, Y.(2011). Motif-guided sparse decomposition of gene expression data for regulatory module identification. BMC Bioinformatics, 12(82) 16 pages as published on-line. Yu, G. , Li, H. , Ha, S. , Shih, I.-M. , Clarke, R. , Hoffman, E.P. , Madhavan, S. , Xuan, J. & Wang, Y.(2011) PUGSVM: a caBIGtm analytical tool for multiclass gene selection and predictive classification. Bioinformatics, 27, 736-738. Lahti, L. , Schäfer, M. , Klein, H.-U. , Bicciato, S., & Dugas, M. (2012) Cancer gene prioritization by integrative analysis of mRNA expression and DNA copy number data: a comparative review. Briefings in Bioinformatics. Online. Ghorai, S. , Mukherjee, A. , Sengupta, S. , & Dutta, P. K. (2011) Cancer Classification from Gene Expression Data by NPPC Ensemble, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8(3). de Souto, M. CP , Costa, I. G. , de Araujo, D. SA , Ludermir, T. B. & Schliep, A.(2008). Clustering cancer gene expression data: a comparative study. BMC Bioinformatics, 9(497). Glaab, E. , Garibaldi, J. , & Krasnogor, N.(2009) ArrayMining : a modular web-application for microarray analysis combining ensemble and consensus methods with cross-study normalization, BMC Bioinformatics, 10(1), 358. Deb, K. (2001) . Multi-objective optimization using evolutionary algorithms, John Wiley and Sons Ltd, England. Coello, C. , Veldhuizen, D. V. , & Lamont, G. (2002) Evolutionary algorithms for solving multi-objective problems, Kluwer Academic Publishers, Boston. Srinivas, N. , & Deb, K. , (1995) Multiobjective function optimization using nondominated sorting genetic algorithms, Evolutionary Computation, 2(3), 221-248. Zitzler, E. , & Thiele, L. (1998) Multiobjective optimization using evolutionary algorithms - A comparative case study, Parallel Problem Solving From Nature,V, Springer-Verlag, Berlin, 292-301. Kirkpatrick, S. , Gelatt, C. & Vecchi, M. (1983) Optimization by simulated anealing, Science, 220, 671-680. Suman, B. (2005) Study of the self-stopping PDMOSA and performance measure in multiobjective optimization, Computers and Chemical Engineering, 29(5), 1131-1147. Smith, K. , Everson, R. & Fieldsend, J. (2004) Dominance measures for muti-objective simulated annealing, in Proceedings of the 2004 IEEE Congress on Evolutionary Computation (CEC'04), 23-30. Alfredo, G. , Santana-Quintero, Luis V. , Coello, Carlos Coello , Caballero, R. & Molina, J. (2006) A new proposal for multi-objective optimization using differential evolution and rough sets theory in Proceedings of the (GECCO'06).

20 Smolinski, T. G. , Buchanan, R. , Boratyn, G. M. , Milanova, M. & Prinz, A. A. (2012) Independent component analysis-motivated approach to classificatory decomposition of cortical evoked potentials in Proceedings of The Third Annual Conference of the MidSouth Computational Biology and Bioinformatics Society, BMC Bioinformatics, 7(Suppl 2)( S8), 1-12. Maulik, U. & Sarkar, A. (2010) Evolutionary Rough Parallel Multi-Objective Optimization Algorithm, Fundamenta Informaticae, 99(1), 13-27. Lin, W. , Wu, Y. , Mao, D. , Yu, Y. (2008) Attribute Reduction of Rough Set Based on Particle Swarm Optimization with Immunity in Proceedings of the 2008 Second International Conference on Genetic and Evolutionary Computing WGEC ’08, IEEE Computer Society, Washington.

ADDITIONAL READING SECTION Sharan, R. (2003) CLICK and EXPANDER: a system for clustering and visualizing gene expression data. Bioinformatics, 19. 1787–1799. DeRisi, J. , Iyer, V. & Brown, P. (1997) Exploring the metabolic and genetic control of gene expression on a genome scale. Science, 282, 257–264. Chu, S. (1998) The transcriptional program of sporulation in budding yeast. Science, 202, 699–705. Cho, R. J. ( 1998) A genome-wide transcriptional analysis of the mitotic cell cycle. Molecular cell. 2. 65–73. Dhilon, I. , Marcotte, E. ,& Roshan, U. (2003) Diametrical clustering for identifying anticorrelated gene clusters. Bioinformatics, 19. 1612–1619. Horn, D. & Axel, L. (2003). Novel clustering algorithm for microarray expression data in a truncated svd space. Bioinformatics, 19. 1110–1115. Bandyopadhyay, S. , Mukhopadhyay, A. & Maulik, U. (Bandyopadhyay, 2007). An improved algorithm for clustering gene expression data.Bioionformatics. 23( 21), 2859–2865. Tou, J. T. & Gonzalez, R. C. (1974). Pattern recognition principles. Reading. MA: Addison-Wesley. Chen, Y. L. & Hu, H. L. (2006). An overlapping cluster algorithm to provide non-exhaustive clustering. Eur. J. Oper. Res. 173. 762–780. Bandyopadhyay, S. & Saha, S. (2007) GAPS: A clustering method using a new point symmetry-based distance measure. Pattern Recognition, 10( 12), 3430–3451. Bandyopadhyay, S. & Saha, S. (2008). A point symmetry based clustering technique for automatic evolution of clusters. IEEE Transactions on Knowledge and Data Engineering, 20( 11), 1–17. Kim, S. Y. (2001) Effect of data normalization on fuzzy clustering of DNA microarray data. BMC Bioinformatics, 17, 309–318. Maulik, U. & Sarkar, A. (2013) Searching remote homology with spectral clustering with symmetry in neighborhood cluster kernels, PLoS ONE, 8(2), e46468. Sarkar, A. , Nikolski, M. & Maulik, U. (2011) Spectral clustering on neighborhood kernels with modified symmetry for remote homology detection, in Proceedings of Second International Conference on Emerging Applications of Information Technology (EAIT), 269-272, Kolkata, IEEE-CS Explorer. Maulik, U. & Sarkar, A. (2012), Efficient parallel algorithm for pixel classification in remote sensing imagery, GeoInformatica 16(2), 391-407. Hvidsten, T. R., Laegreid, A. & Komorowski, J. (2003). Learning rule-based models of biological process from gene expression time profiles using gene ontology. Bioinformatics, 19( 9), 1116–1123. Kanungo, T. , Mount, D. , Netanyahu, N. , Piatko, C. , Silverman, R. & Wu, A. (2002) An efficient kmeans clustering algorithm: analysis and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7), 881–892. Kalyanaraman, A. , Aluru, S. & Brendel, V. (2003) Space and time efficient parallel algorithms and software for EST clustering. IEEE Transactions on Parallel and Distributed Systems, 14(12), 1209–1221. Sarkar, A. & Maulik, U. (2009) Parallel Point symmetry Based Clustering for Gene Microarray Data, In the Proceedings of Seventh International Conference on Advances in Pattern Recognition-2009 (ICAPR, 2009), Kolkata, IEEE Computer Society, Conference Publishing Services (CPS), 351-354.

21 Ganesan, T. , Elamvazuthi, I. , Shaari, K.Z.K. , Vasant, P. (2013) Multiobjective optimization of green sand mould system using chaotic differential evolution. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 8160, 145–163. Elamvazuthi, I. , Vasant, P. , Ganesan, T. (2013) Hybrid Optimization Techniques for Optimization in a Fuzzy Environment. Intelligent Systems Reference Library, 38, 1025–1046. Jiménez, F. ,Sánchez, G., Vasant, P. (2013) A multi-objective evolutionary approach for fuzzy optimization in production planning. Journal of Intelligent and Fuzzy Systems, 25( 2), 441–455. Sarkar, A. , (2009) Scalable Modified Symmetry Based Clustering on Gene Expression Data Set, in the Proceedings of UGC Sponsored State Level Seminar on Playing God : Expanding Frontiers of Biotechnology, Kolkata. Rajko, S. & Aluru, S. (2004) Space and time optimal parallel sequence alignments. IEEE Trans. Parallel Distrib. Syst. 15( 12), 1070–1081. Jiang, K. , Thorsen, O. , Peters, A. E. , Smith, B. E. & Sosa, C. P. (2008). An efficient parallel implementation of the hidden markov methods for genomic sequence-search on a massively parallel system. IEEE Trans. Parallel Distrib. Syst. 19(1), 15–23. Sarkar, A. & Maulik, U. (2009) Parallel Clustering Technique Using Modified Symmetry Based Distance, In the Proceedings of 1st International Conference on Computer, Communication, Control and Information Technology (C3IT 2009), MacMillan Publishers India Ltd., 611-618 . Liu, W. & Schmidt, B. (2006) Parallel pattern-based systems for computational biology: A case study. IEEE Trans. Parallel Distrib. Syst., 17(8), 750–763. Rajasekaran, S. (2005) Efficient parallel hierarchical clustering algorithms. IEEE Trans. Parallel Distrib. Syst., 16(6), 497–502. Chen, L. , Pan, Y. & Xu, X. hua. (2004) Scalable and efficient parallel algorithms for euclidean distance transform on the LARPBS model. IEEE Trans. Parallel Distrib. Syst. 15(11), 975–982. Sarkar , A. & Maulik, U. (2009) An Efficient Parallel Point Symmetry Based Clustering Algorithm, In the Proceedings of IEEE National Conference on Computing and Communication Systems (CoCoSys-09), published by University Institute of Technology, Burdwan University, 131-136. Hollander, M. & Wolfe, D. (1999) Nonparametric statistical methods. 2nd ed. USA: Weily. 1999. Wen, X. (1998). Large-scale temporal gene expression mapping of central nervous system development. Proceedings National Academy Sciences USA, 95, 334–339. Iyer, V. R. (1999) The transcriptional program in the response of human fibroblasts serum. Science, 283, 83–87. The Gene Ontology Consortium. (2000) Gene ontology: tool for the unification biology. Nat. Genet., 25, 25–29. Shahrour, F. A. (2004). FatiGO: a web tool for finding significant associations to gene ontology terms with groups of genes. Bioinformatics, 20, 578–580. Mahoney, A. W. , Podgorski, G. J. , Flann, & N. S. (2012) Multiobjective Optimization Based-Approach for Discovering Novel Cancer Therapies, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(1), 169-184. Abo-Sinnaa, M.A. & Husseinb, M.L. (Abo-Sinnaa, 1995) An Algorithm for Generating Efficient Solutions of Multiobjective Dynamic Programming Problems, European J. Operational Research, 80(1), 156-165. Alarcón, T. (2003) A Cellular Automaton Model for Tumour Growth in Inhomogeneous Environment, J. Theoretical Biology, 225(2), 257-274. Amyot, F. , Small, A. & Gandjbakhche, A.H. (2006) Stochastic Modeling of Tumor Induced Angiogenesis in a Heterogeneous Medium, the Extracellular Matrix, Proc. 28th Ann. IEEE Int'l Conf. Eng. in Medicine and Biology Soc. (EMBS '06) 3146-3149. Arakelyan, L. , Vainstein, V. & Agur, Z. (2002) A Computer Algorithm Describing the Process of Vessel Formation and Maturation, and Its Use for Predicting the Effects of Anti-Angiogenic and Anti-Maturation Therapy on Vascular Tumor Growth, Angiogenesis, 5(3), 203-214.

22 Auerbacha, R. , Lewis, R. , Shinners, B. , Kubai, L. & Akhtar, N. (2003) Angiogenesis Assays: A Critical Overview, Clinical Chemistry, 49(1), 32-40. Bauer, A.L., Jackson, T.L. , & Jiang, Y. (2007) A Cell-Based Model Exhibiting Branching and Anastomosis during Tumor-Induced Angiogenesis, Biophysical J., 92( 9), 3105-3121. Bauer, A.L. , Jackson, T.L. & Jiang, Y. (2009) Topography of Extracellular Matrix Mediates Vascular Morphogenesis and Migration Speeds in Angiogenesis, PLoS Computational Biology, 5(7), e1000445. Boyle, P. & Levin, B. (2008) eds., World Cancer Report 2008, IARC Nonserial, ISBN-13 9789283204237, ISBN-10 9283204239. Byrne, H.M. , Alarcon, T. , Owen, M.R., Webb, S.D. , & Maini, P.K. (2006) Modelling Aspects of Cancer Dynamics: A Review, Philosophical Transactions, Series A, Math., Physical, and Eng. Sciences, 364(1843), 1563-1578. Byrne, H.M. , Owen, M.R. , Alarcon, T.A. , Murphy, J. & Maini, P.K. (2006) Modelling the Response of Vascular Tumours to Chemotherapy: A Multiscale Approach, Math. Models and Methods in Applied Sciences, 16(7S), 1219-1241. Carmeliet, P. & Jain, R.K. (2000) Angiogenesis in Cancer and Other Diseases, Nature, 407(6801), 249-257. Carraway, (1990) Generalized Dynamic Programming for Multicriteria Optimization, European J. Operational Research, 44( 1), 95-104. Chaplain, M.A.J.A. , McDougall, S.R.R. & Anderson, A.R.A.R. (2006) Mathematical Modeling of Tumor-Induced Angiogenesis, Ann. Rev. Biomedical Eng., 8, 233-257. Chattopadhyay, A. & Seeley, C. (1994) A Simulated Annealing Technique for Multiobjective Optimization of Intelligent Structures, Smart Materials and Structures, 3, 98-106. Deb, K. (2001) Multi-objective Optimization Using Evolutionary Algorithms. John Wiley and Sons. Ferrara, N. , Hillan, K. & Novotny, W. , Bevacizumab (Avastin), (2005) A Humanized Anti-VEGF Monoclonal Antibody for Cancer Therapy, Biochemical and Biophysical Research Comm., 333(2), 328-335. Finley, R.S. (2002) New Directions in the Treatment of Cancer: Inhibition of Signal Transduction, J. Pharmacy Practice, 15(1), 5-16. Fonseca, C. & Fleming, P. (1995) An Overview of Evolutionary Algorithms in Multiobjective Optimization, Evolutionary Computation, 3( 1), 1-16.

KEY TERMS AND DEFINITIONS Clustering: Assigning similar elements to one group, which increases intra-cluster similarity and decreases inter-cluster similarity. Automatic classification: Classification method without any prior knowledge about initial number of clusters Validity index: Index to estimate compactness of the clusters, leading to properly identified distinguishable clusters. Gene expression data: Conversion data from encoded gene to messenger RNA and then to protein. Rough set: Set of elements which lie between lower and upper approximations of a crisp set according to rough set theory by Pawlak. Simulated Annealing: A probabilistic method to find global minimum of a cost function that may possess several local minima.

23 Multi-objective Optimization: Method to find out a set of representative Pareto optimal solutions and quantify the trade-offs in satisfying the different multiple objectives. Cancer biomarker: A substance or process which indicates presence of cancer in body, including genetic, epigenetic, proteomic, glycomic and imaging biomarkers.