A Relational Clustering Approach | Springer Link

15 downloads 115 Views 494KB Size Report
Jun 25, 2010 - DNAJA3. Apoptosis; cell death; It is an important cell death regulator and could negative regulation exert tumor suppressor activity [12].
J Math Model Algor (2010) 9:275–289 DOI 10.1007/s10852-010-9140-2

Combining Gene Expression Profiles and Drug Activity Patterns Analysis: A Relational Clustering Approach Elisabetta Fersini · E. Messina · F. Archetti · C. Manfredotti

Received: 13 January 2009 / Accepted: 9 March 2010 / Published online: 25 June 2010 © Springer Science+Business Media B.V. 2010

Abstract The combined analysis of tissue micro array and drug response datasets has the potential of revealing valuable knowledge about various relations among gene expressions and drug activity patterns in tumor cells. However, the amount and the complexity of biological data needs appropriate data mining models in order to extract interesting patterns and useful information. The ultimate goal of this paper is to define a model which, given the gene expression profile related to a specific tumor tissue, could help in selecting a set of most responsive drugs. This is accomplished through an integrated framework based on a constraint-based clustering algorithm, called Relational K-Means, which groups cell lines using drug response information and taking into account cell-to-cell relationships derived from their gene expression profiles. Keywords Relational clustering · Pharmacogenomics · NCI60 dataset analysis

1 Introduction Thanks to recent progresses in biological experimental technologies, for example cDNA microarrays, large amount of data have been collected. This evidence offers important opportunities to increase the knowledge related to complex biological phenomena. An important research field is related to the discovery of embedded relationships among human cancer, gene expression profile and drug activity. Highlighting these relationships is of crucial importance for several objectives, among others: elucidation of mechanisms of the cancer development, identification of new

E. Fersini (B) · E. Messina · F. Archetti · C. Manfredotti DISCo, University of Milano-Bicocca, Building U14, Viale Sarca 336/14, 20126, Milan, Italy e-mail: [email protected] F. Archetti Consorzio Milano Ricerche, Milan, Italy

276

J Math Model Algor (2010) 9:275–289

molecular targets for anti-cancer drugs and definition of an individual, or at least targeted, therapy driven by a specific gene profile. Finding evidence of relationships between gene expression profile and drug responses has been the objective of several works. In Scherf et al. [40] the authors explore the drug–gene correlations starting from a disjoint examination of gene expression and drug response relationships. To this purpose, a hierarchical clustering algorithm, with several similarity metrics, has been used to analyze: (1) cell-to-cell correlation on the basis of gene expression and drug activity profiles, (2) relationships between drug activity patterns and mechanism of action, (3) gene–drug correlation on the basis of gene expression and drug activity profiles. Gene expression profiles and drug response were first exploited by the Soft Topographic Vector Quantization clustering algorithm [17], while their combination is used to model probabilistic dependencies with respect to nine different cancer types. In Chang et al. [5] the existence of a common pattern in gene expression and drug activities of cell lines derived from the same tissue of origin has been investigated. In Dasgupta et al. [9] the relationships between genomic features and therapeutic response using the Partial Least Squares (PLS) technique has been modeled. They analyzed gene expression patterns to extract the underlying gene groups that correlate to the drug response. The general result of the above quoted papers and of a wide set of related approaches is that the drug activity patterns are less related to the organ of origin compared to the gene expression profile patterns. The absence of a common pattern between gene expression profiles and drug-response suggests that this heterogeneity might be partly due to the activity of genes related to drug sensitivity and resistance. This idea has been supported by the fact that, as shown in Scherf et al. [40], several cell lines with a relatively high expression level of gene ABCB1—known to regulate multidrug resistance—have been clustered in the same group. Consequently, this indicates that chemoresponse mechanisms are distributed across different tissues in the panel and that it should be possible to link drug activity to gene expression profiles. Inspired by the growing consensus on the biological value of this links, or relationships, we have developed a novel relational clustering algorithm aimed at investigating whether drug response can be related to subsets of gene patterns. Computational results show that this relational algorithm yields a better performance in the obtained clusters than that of a traditional distance-based solution.

2 Relational Clustering Several methods from the statistics and machine learning fields, such as hierarchical clustering [13], principal component analysis (PCA) [35], neural networks [25], and Bayesian Networks [15, 21, 22] have been applied to high-throughput genomic analysis. In particular, several studies have been conducted with traditional distancebased clustering algorithms, in order to identify common patterns in gene expression and drug activities of cell lines. These algorithms, such as K-Means [30], EM [11], UPGMA [42], aim at partitioning data points into a pre-specified number of clusters through the minimization of a cost function related to a similarity/dissimilarity mea-

J Math Model Algor (2010) 9:275–289

277

sure between the points. This similarity/dissimilarity measure is usually computed merely on the basis of a distance measure, without taking into account background information about pairs of instances for constraining their cluster placement. The conclusions drawn by previous investigations and the weakness of traditional distance-based algorithms, lead us to the definition of a constraint-based clustering algorithm, following the idea of introducing specific domain constraints derived from a background knowledge, as presented in Wagstaff et al. [45], Bilenko et al. [3], Basu et al. [2], Corsini et al. [8]. In our case, we will cluster cell lines with respect to their drug activity response taking into account constraints derived from their gene expression profiles. Let x be a cell line defined as a vector in Rm+n , where m and n represent the space dimensions of the gene expression profiles and the drug activity features respectively. If we define  as:      = x|x = x xG , x D , xG ∈ Rm , x D ∈ Rn (1) then     G = xG |x = xG , x D ∈       D = x D |x = xG , x D ∈ 

(2) (3)

where G and  D denote the set of the cell lines represented through their gene expression profiles and their drug activity response respectively. We assume that a relation among two instances can be either an affinity or a diversity relationship and can be represented by different degrees of intensity. The affinity and diversity constraints coming from gene expression profiles are derived from clustering the elements of G . Elements in the same cluster will be considered as affine and elements belonging to different clusters will be considered as diverse. The strength of these relationships depends on the distance between elements and will affect the weights which will be used to penalize the similarity/dissimilarity measure when clustering elements in  D i.e. with respect to drug response activity, as explained in more details in the next sub-sections. The computational process of the proposed relational clustering algorithm is depicted in Fig. 1. 2.1 Relational Constraints Discovery as a Clustering Problem In order to obtain relations between instances we cluster the cell lines on the basis of their gene expression profiles using a distance measure based on the Pearson Correlation Coefficient [7]: corrik =

E((xi − μi )(xk − μk )) σi σk

(4)

where E(x) represents the Expected Value of x, and μ and σ are respectively the mean and the standard deviation of x. Since in our analysis the gene expression level and the drug activity of each cell line have been scaled to 0 mean, we can replace corrik with the cosine measure cosik =

< xi , xk > xi xk 

(5)

278

J Math Model Algor (2010) 9:275–289

and the distance between elements can be defined as: dist(xi , xk ) = 1 − cosik

(6)

Given the elements xiG ∈ G and a set of clusters C Gj with j = 1 : J, the clustering problem consists in assigning each element xiG to a cluster C Gj such that the intracluster distance is minimized and the inter-cluster distance is maximized. If we define a matrix Z of dimension J × ||, as:  1 if xiG ∈ C Gj zij = (7) 0 otherwise the problem can be formulated, in general terms, as: ⎤ ⎡ || J    G G

 G G ⎣ min dist xi , xk zij zkj − dist xi xk zij(1 − zkj) ⎦ j=1

i,k=1

s.t. J 

zij = 1



∀i

j=1

zij ∈ {0, 1}

(8)

This is a quadratic assignment problem known to be NP-Hard [16] and several heuristics has been proposed to solve it as presented in Hand et al. [19]. In this paper we do not address explicitly the minimization defined in Eq. 8, but we solve the clustering problem using the well-known K-Means algorithm. The algorithm performs iteratively the following steps: 1. Set J as the desired number of clusters 2. Select randomly in Rn J representative elements, called centroids 3. Assign all data points to the centroid with the minimum distance

Fig. 1 The computational process of the proposed relational clustering framework

J Math Model Algor (2010) 9:275–289

279

4. Update the centroids as the mean of the element belonging to each cluster 5. Check the Convergence Condition 5.1. If all data points are assigned to the same cluster with respect to the previous iteration, then Stop the Process 5.2. Otherwise reapply the assignment process starting from step 3. The obtained set of clusters leads us to define two matrices of relations RU and R that will be used in the subsequent clustering phase. U RU is a || × || matrix whose elements rik represent the weights of the diversitylinks between elements belonging to different clusters that will suggest in the following phase that two tissues should not be placed in the same cluster. More U formally, rik is defined as the distance between xiG and xkG , computed by using Eq. 6 i.e. A

 U = rik

  dist xiG , xkG 0

if xiG ∈ CαG and xkG ∈ CβG  = CαG otherwise

(9)

The matrix R A , having the same dimension of RU , represents the weight of the af f inity-links of elements belonging to the same cluster and suggests that two A elements should be placed in the same cluster. The element rik is given by

 A rik

=

  dist xiG , xkG 0

if xiG ∧ xkG ∈ CαG otherwise

(10)

2.2 Relational K-Means Now that the relations between elements have been obtained, we can proceed with the second phase of the algorithm which aims at grouping cell lines using drug response information while taking into account the given cell-to-cell relationships. This clustering is aimed at minimizing the sum of the distances between elements, expressed by their drug activity response, penalized by a function that takes into account affinity and diversity constraints. Let xiD ∈  D be a given data point represented by its drug response features and D C j , with j = 1 : J, be a set of clusters. Defining Z as in Eq. 7, for cell lines expressed by their drug responses, i.e.

 zij =

1 if xiD ∈ C D j 0 otherwise

(11)

280

J Math Model Algor (2010) 9:275–289

our Relational K-Means clustering aims at solving an optimization problem formulated as follows: ⎡ || J  

    ⎣ min dist xiD , xkD zij zkj − dist xiD xkD zij(1 − zkj) j=1

i,k=1 U dist + zij zkjrik



 D

xiD xk

⎤     A + zij(1 − zkj) 1 − rik dist xiD xkD ⎦

s.t. J 

zij = 1



∀i

j=1

zij ∈ {0, 1}

(12)

In this way if an affinity-link (or a diversity-link) is not preserved, the objective U A function is penalized according to the weight rik (or rik ) and the distance between  cell lines i and k represented in terms of their drug responses features, dist xiD xkD . Only a modification of the distance function exploited into K-Means has been introduced in order to account constraints violation in Eq. 12.

3 Experimental Settings 3.1 The NCI60 Dataset The NCI60 dataset, built at the National Cancer Institute USA, is composed by 60 cell lines from nine different human cancer tissues: colorectal, renal, ovarian, breast, prostate, lung, central nervous system, leukemia and melanoma. The dataset composition is given in Table 1. For the experimental validation we used two NCI60 datasets, Staunton [43] and Ross [40], obtained by using two different DNA microarray technologies: high density affymetrix and spotted cDNA array. The NCI60 data consists of two matrices T and A representing, for each cell line, its gene expression profile and its in vitro drug sensitivity response. Notice that the columns of T and A correspond to the instances of G and  D defined in Eqs. 2 and 3. The Staunton dataset represents each cell line by using 7,129 genes and 5,084 drug compounds. The dataset was downloaded from the Whitehead Institute Cancer Genomics supplemental data to the paper from Staunton et al.1 The Ross dataset contains the expression of 1,375 genes, characterized by strong pattern of variation among cell lines and having less than five missing values, and the drug-response value of 1,400 chemical compounds. This dataset was retrieved from The NCI Genomics and Bioinformatics Group Datasets resource.2 In both cases drug activities are obtained by monitoring the changes in total cellular protein after 48 h of drug treatment, tested one at time and independently.

1 http://www-genome.wi.mit.edu/mpr/NCI60/. 2 http://discover.nci.nih.gov/datasetsNature2000.jsp.

J Math Model Algor (2010) 9:275–289 Table 1 NCI60 dataset composition

281

Tumor category

Number of cell lines

Colorectal Renal Ovarian Breast Prostate Lung Central nervous system Leukemias Melanomas

7 8 6 8 2 9 6 6 8

3.2 Evaluation Measures In order to evaluate the quality of the proposed relational clustering algorithm, we computed the widely used Average Pearson Correlation Coefficient, as internal evaluation measure, defined as: P=

|| || J  n j   corrik zij zkj || i=1 |n j|2 j=1

(13)

k=1

where J is the number of clusters and n j is its cardinality. We estimate P with respect to two different set of features: in one case P is computed by considering the Pearson’s correlation between instances i and k according to their gene expression G profiles (P ) and in another case by using their drug response/sensitivity profiles D (P ). As external evaluation measures, we considered F-Measure and Entropy. The traditional F-Measure which combines the Precision and Recall Measures is defined as follows. Given a set of class label L, related to the Category reported in Table 1, we compute the Precision and Recall for each class label l ∈ L with respect to the cluster j as: Precision(l, j) = Recall(l, j) =

nlj nl

nlj nj

(14) (15)

where – – –

nlj is the number of elements belonging to the class label l and located in cluster j n j represents the cardinality of cluster j, and nl is the number of cell lines with class label l.

The F-Measure for each class label l ∈ L is computed as the harmonic mean of Precision and Recall: F(l, j) =

2 × Recall(l, j) × Precision(l, j) Recall(l, j) + Precision(l, j)

(16)

282

J Math Model Algor (2010) 9:275–289

The overall quality of the resulting clustering solution is given by the scalar F ∗ computed as the weighted sum over the maximum F-Measures obtained for each class l ∈ L:  nl F∗ = (17) max{F(l, j)} || j∈J l

The second external evaluation measure used in the experimental phase is the Entropy, which evaluates the purity of the clusters with respect to the given class labels. To compute the entropy of a set of clusters, we first calculate the class distribution of the objects in each cluster, i.e. for each cluster j we compute the probability plj that a member of cluster j belongs to class l. Given this class distribution, the entropy of cluster j is calculated using the standard entropy formula:  Ej = − pljlog( plj) (18) l

The total Entropy for a set of clusters, represented by a scalar E∗ , is computed as the weighted sum of the entropies of each cluster j: E∗ =

J  nlj Ej || j=1

(19)

where nlj is the number of elements belonging to the class label l and located in cluster j.

4 Experimental Results To demonstrate the efficacy of the proposed algorithm we applied the relational KMeans to the NCI60 dataset, comparing our results to that obtained by a traditional K-Means algorithm. In Tables 2 and 3 we report the results obtained over the Ross dataset by the traditional K-Means and by the relational algorithm respectively. In Tables 4 and 5 results over the Staunton dataset are reported. Since both the traditional K-Means and our approach depend on the initial choice of the representative elements of each cluster (centroids), we show a set of statistics computed over 500 independent runs. In particular, we report: minimum, maximum, average, standard deviation and confidence interval (confidence level at 95%). We can see that all the average performance indices of the relational approach are improved with respect to the average performance of traditional K-Means approach. In particular, with respect to external quality measures, our approach produces a clustering solution that is characterized by more purity and discriminative power than the traditional technique. Indeed, the introduction of relational constraints leads to clusters that are more homogeneous both in terms of gene expression and drug activity than the clusters obtained by the Traditional K-Means Algorithm. Moreover, the clusters obtained by applying the proposed relational approach seem to be consistent with the well-known anticancer drugs categories (toxics, alkylating and antimetabolite agents). We would like to point out that the clusters obtained by our approach could be used to suggest the most responsive drugs for a “new” cell line given its gene

J Math Model Algor (2010) 9:275–289 Table 2 K-Means performance over the Ross dataset

283 K-Means

PG PD F∗ E∗

Min

Max

Avg

Std. dev.

Conf. int.

0.412 0.800 0.398 0.668

0.506 0.844 0.577 1.366

0.474 0.829 0.500 1.014

0.021 0.002 0.048 0.151

0.468–0.479 0.826–0.832 0.487–0.514 0.9722–1.056

expression profile. This can be done by first assigning the new cell to the cluster whose representative element is the most similar with respect to the gene expression profile, and then ranking the drugs with respect to the number of times they show high response in the samples belonging to the cluster. An example is depicted in Fig. 2 in which the cell lines of a cluster are represented by columns whose upper part represents drug response and the lower one represents the gene expression profile. Each drug has been represented in terms of high, medium and low level of response with respect to its range of activity variation. The gene profile of each cell line is colored in order to reflect the mean-adjusted expression level of the gene (row) and cell line (column). If a cell line would be assigned to this cluster the drugs reported on the right are those suggested by the ranking based on a frequentist intra-cluster approach. From a biological point of view a first evaluation has been performed by further analysing the results through a feature selection procedure in order to highlight the functional role of the most informative genes characterizing each cluster. Indeed, since clusters represent sets of cell lines that show a similar response to anti-cancer therapy also taking into account genomic information, by performing a feature

Table 3 Performance of the relational approach over the Ross dataset

Relational PG PD F∗ E∗

Table 4 K-Means performance over the Staunton dataset

Min

Max

Avg

Std. dev.

Conf. int.

0.464 0.813 0.485 0.780

0.616 0.844 0.656 1.222

0.503 0.830 0.571 0.961

0.020 0.006 0.043 0.094

0.497–0.510 0.828–0.832 0.559–0.583 0.934–0.987

K-Means PG PD F∗ E∗

Table 5 Performance of the relational approach over the Staunton dataset

Min

Max

Avg

Std. dev.

Conf. int.

0.820 0.770 0.373 1.152

0.864 0.835 0.477 1.630

0.847 0.804 0.419 1.332

0.008 0.014 0.025 0.097

0.845–0.850 0.800–0.807 0.413–0.426 1.305–1.360

Relational PG PD F∗ E∗

Min

Max

Avg

Std. dev.

Conf. int.

0.834 0.742 0.356 1.039

0.879 0.854 0.486 1.669

0.865 0.805 0.425 1.289

0.007 0.025 0.034 0.153

0.863–0.867 0.798–0.812 0.415–0.434 1.255–1.340

284

J Math Model Algor (2010) 9:275–289

Table 6 Gene Selection based on relational clustering Relational clustering Gene ID Gene name Biological process 1 471220

RAI14

Cellular growth

2 248589

SGK1

Apoptosis

3 485677

SERPINE2 Cell differentiation; Regulation of cell migration

4 360021

ADCY7

5 21822

CD24

6 325120

TIMP-2

7 487878

SPARC

8 415605

EPS8

9 486844

GJA1

10 416227

MUS81

Role (referred literature) Retinoic acid plays a critical role in development, cellular growth, and differentiation and induces the expression of a variety of genes. Deletion and mutation analysis indicated that the identified nuclear localization sequence is indispensable for nuclear targeting [27]. SGK1 is a downstream target of cell survival and that it is primarily regulated at the level of transcription [1, 32]. The study of SERPINE2 suggests that this protein may play an important part in the process of metastasis [4]. SERPINE2 has been shown to inhibit tumor cell-mediated extracellular matrix destruction [10]. ADCY7 is a specific downstream effector of the G(12/13) for regulating apoptosis [24].

Activation/inhibition of adenylate cyclase activity by G-protein signaling pathway; Activation of protein kinase A activity; Cellular growth Apoptosis; Nuclear CD24 has emerged as a new oncogene and changes; Cell metastasis promoter [28]. CD24 appears activation; Cell to be highly expressed in a large variety migration of human cancers and to contribute to the acceleration of tumor growth and metastases [37]. Negative regulation of TIMP-2 is shown to strongly inhibit cell proliferation; cancer invasion and metastasis [14]. regulation of MAPKKK cascade Regulation of cell SPARC is a secreted protein, acidic and rich in proliferation cysteines. It is a matrix associated protein that elicits changes in cell shape, inhibits cell cycle progression, and influences the synthesis of extracellular matrix. Clinical evidence indicates that SPARC expression correlates with tumor progression [6]. Cell proliferation EPS8, which activates Rac1, is part of an integrated signaling pathway modulating integrin-dependent tumour cell motility and therefore is a possible therapeutic target [46]. Apoptosis GJA1 is involved in several kinds of tumor, as breast, lung, prostate and ovarian [29, 36, 44]. DNA reparation Mus81 is involved in the formation of double-strand DNA breaks in response to the inhibition of replication [47].

J Math Model Algor (2010) 9:275–289

285

Table 7 Gene selection based on traditional K-Means clustering K-Means clustering Gene ID Gene name

Biological process

Role (referred literature)

1

469841

PON2

Aromatic compound catabolic process

2

487811

PAM

3

486070

S100A6

Cellular process; peptide metabolic process; protein modification process Positive regulation of fibroblast proliferation

The analysis of PON2 genotypes distribution showed higher percentage of genotype among coronary artery disease positive compared with coronary artery disease negative [23]; PON2 plays a role in the development of atherosclerosis in vivo [33] The 37-amino acid peptide called amylin is a major component of the islet amyloid deposited in the pancreases of persons with type 2 diabetes mellitus [34].

4

485209

5

310521

6

366242

7 8

487396 324386

9

488721

10 114348

The presence of S100A6 results in higher p53 transcriptional activity which is also reflected by higher cell susceptibility to apoptosis evoked by hydrogen peroxide [41]. RCN1 No evidence RCN1 play a potential role in haemostasis and thrombosis... [20] DNAJA3 Apoptosis; cell death; It is an important cell death regulator and could negative regulation exert tumor suppressor activity [12] of cell proliferation KCTD3 Potassium ion The antigens KCTD3 reacts to the antibody transport antibody SEREX in patients with renal-cell carcinoma [39]. No Evidence No Evidence No Evidence GNAI1 Signal transduction; GNAI1 signaling pathway play an essential inhibition of role in the activation of Akt, which adenylate cyclase is an important signaling intermediate in activity by G-protein regulating cell survival and cell cycle signaling pathway progression [26]. IGFBP7 Regulation of cell IGFBP7 strongly suppresses tumor growth by growth; negative an insulin-independent or insulin-like growth regulation of cell factor-independent mechanism [38]. proliferation C2orf56 No evidence No evidence

selection activity we should be able to identify the subset of genes that could possibly regulate the cells response behaviour. In order to validate the hypothesis that the obtained groups of cell lines embed useful information for helping the pharmacology of cancer, we applied the feature selection technique presented in Hall [18], known as Correlation-based feature selection, also called CFS subset evaluation. This is an heuristic that takes into account the usefulness of individual features (genes) for predicting a given class label (drug response pattern). The assumption on which the heuristic is based is “Good feature subsets contain features highly correlated with the class, yet uncorrelated with each other.” We performed the CFS feature selection at each run and we considered as informative genes the 10 features selected most frequently over the 500 runs. We then searched for the biological functions to

286

J Math Model Algor (2010) 9:275–289

Fig. 2 An example of cluster obtained by the proposed approach. In the top of the cluster we report the drug activity response and in the bottom the gene expression profile. For a new cell line associated to this cluster, our method with a frequentist approach will suggest the use of 6H-Pyrido-based compound

which these genes are associated by accessing the Entrez Gene Database,3 which is a NCBI’s (National Center for Biotechnology Information) database for gene-specific information [31]. In Tables 6 and 7 we report the 10 genes selected starting from the clustering results obtained on the Ross dataset. Column one reports the NCI-60 ID for each gene, column two indicate the gene name through the Entrez Gene accession, column three contains a description of the gene function, and finally column four reports a description—extracted from the literature—of the role of the gene. Analysing the resulting top 10 genes indicated as informative by the k-Means approach only five are effectively considered relevant for cancer development functionalities, while the 10 genes highlighted by our clustering approach are all recognized as relevant. 5 Conclusion In this paper, the NCI60 dataset has been analyzed for the molecular pharmacology of cancer. The 60 cell lines were clustered by using a Relational K-Means Algorithm. The algorithm aims at defining groups of cell lines by using drug response information and taking into account, through relational constraints, cell-to-cell relationships defined by the similarity of their gene expression profiles. We evaluated our approach in terms of Average Person Correlation Coefficient, F-Measure and Entropy.

3 http://www.ncbi.nlm.nih.gov/gene/

J Math Model Algor (2010) 9:275–289

287

The experimental results show that the proposed method outperforms the traditional distance-based K-Means technique. In particular, our clustering algorithm brings to the definition of clusters that are homogeneous both in terms of gene expression and drug activity profiles. Moreover, our approach produces a clustering solution that is characterized by more purity and discriminative power than the traditional technique. The defined model could help in selecting a set of most responsive drugs given the gene expression profile related to a specific tumor tissue. Our work is now addressed to (1) include in the proposed approach further kind of relations and (2) to develop probabilistic models able to predict drug responses given an expression level profile.

Author Disclosure Statement

No competing financial interests exist.

References 1. Amato, R., Menniti, M., Agosti, V., Boito, R., Costa, N., Bond, H.M., Barbieri, V., Tagliaferri, P., Venuta, S., Perrotti, N.: IL-2 signals through Sgk1 and inhibits proliferation and apoptosis in kidney cancer cells. J. Mol. Med. 85(7), 707–721 (2007) 2. Basu, S., Bilenko, M., Mooney, R.J.: A probabilistic framework for semi-supervised clustering. In: Kim, W., Kohavi, R., Gehrke, J., DuMouchel, W. (eds.) Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 22–25 August 2004, Seattle, pp. 59–68. ACM, New York (2004) 3. Bilenko, M., Basu, S., Mooney, R.J.: Integrating constraints and metric learning in semisupervised clustering. In: Greiner, R., Schuurmans, D. (eds.) Proceedings of the Twenty-First International Conference on Machine Learning 2004, Alberta, pp. 81–88. ACM, New York (2004) 4. Buchholz, M., Biebl, A., Neee, A., Wagner, M., Iwamura, T., Leder, G., Adler, G., Gress, T.: SERPINE2 (protease nexin I) promotes extracellular matrix production and local invasion of pancreatic tumors in vivo. Cancer Res. 63, 4945–4951 (2003) 5. Chang, J.H., Hwang, K.B., Zhang, B.T.: Analysis of gene expression profiles and drug activity patterns by clustering and Bayesian network learning. In: Lin, S.M., Johnson, K.F. (eds.) Methods of Microarray Data Analysis II, chapter 11, pp. 169–184. Kluwer Academic, Dordrecht (2002) 6. Clark, C.J., Sage, E.H.: A prototypic matricellular protein in the tumor microenvironment where there’s SPARC, there’s fire. J. Cell. Biochem. 104(3), 721–732 (2008) 7. Cohen, J., Cohen, P., West, S.G., Aiken, L.S.: Applied Multiple Regression/Correlation Analysis for the Behavioural Science. Lawrence Erlbaum Associates, Hillsdale (2003) 8. Corsini, P., Lazzerini, B., Marcelloni, F.: A new fuzzy relational clustering algorithm based on the fuzzy C-means algorithm. J. Soft Comput. 9, 439–447 (2005) 9. Dasgupta, N., Lin, S.M., Carin, L.: Modeling pharmacogenomics of the NCI-60 anticancer data set: utilizing kernel PLS to correlate the microarray data to therapeutic responses. In: Lin, S.M., Johnson, K.F. (eds.) Methods of Microarray Data Analysis II, chapter 10, pp. 151–167. Kluwer Academic, Dordrecht (2002) 10. Del Rio, M., Molina, F., Bascoul-Mollevi, C., Copois, V., Bibeau, F., Chalbos, P., Bareil, C., Kramar, A., Salvetat, N., Fraslon, C., Conseiller, E., Granci, V., Leblanc, B., Pau, B., Martineau, P., Ychou, M.: Gene expression signature in advanced colorectal cancer patients select drugs and response for the use of leucovorin, fluorouracil, and irinotecan. J. Clin. Oncol. 25(7), 773–780 (2007) 11. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. 39(1), 1–38 (1977) 12. Edwards, K.M., Mnger, K.: Depletion of physiological levels of the human TID1 protein renders cancer cell lines resistant to apoptosis mediated by multiple exogenous stimuli. Oncogene 23(52), 8419–8431 (2004)

288

J Math Model Algor (2010) 9:275–289

13. Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: Cluster analysis and display of genomewide expression patterns. In: Schekman, R. (ed.) Proceedings of the National Academy of Sciences of the United States of America, vol. 1998, pp. 14863–14868 (1998) 14. Eren, B., Sar, M., Oz, B., Dincbas, F.H.: MMP-2, TIMP-2 and CD44v6 expression in non-smallcell lung carcinomas. Ann. Acad. Med. Singap. 37(1), 32–39 (2008) 15. Friedman, N., Linial, M., Nachman, I., Pe’er, D.: Using Bayesian networks to analyze expression data. In: Shamir, R., Miyano, S., Istrail, S., Pevzner, P., Waterman, M. (eds.) Proceedings of the Fourth Annual International Conference on Computational Molecular Biology, 8–11 April 2000, Tokyo, pp. 127–135 (2000) 16. Gonzalez, T.F.: Clustering to minimize the maximum inter-cluster distance. J. Theor. Comput. Sci. 38, 293–306 (1985) 17. Graepel, T., Burger, M., Obermayer, K.: Self-organizing maps: generalizations and new optimization techniques. J. Neurocomput. 21, 173–190 (1998) 18. Hall, M.A.: Correlation-based feature reduction for discrete and numeric class machine learning. In: Proceedings of the 17th International Conference on Machine Learning (2000) 19. Hand, D.J., Mannila, H., Smyth, P.: Principles of Data Mining. MIT, Cambridge (2001) 20. Hansen, G.A., Vorum, H., Jacobsen, C., Honore, B.: Calumenin but not reticulocalbin forms a Ca2+-dependent complex with thrombospondin-1. A potential role in haemostasis and thrombosis. Mol. Cell. Biochem. 320, 25–33 (2009) 21. Hartemink, A.J., Gifford, D.K., Jaakkola, T.S., Young, R.A.: Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks. In: Altman, R.B., Dunker, A.K., Hunker, L., Lauderdale, K., Klein, T.E.D. (eds.) Proceedings of Pacific Symposium on Biocomputing, 3–7 January 2001, Hawaii, pp. 422–433 (2001) 22. Hwang, K.B., Cho, D.Y., Park, S.W., Kim, S.D., Zhang, B.T.: Applying machine learning techniques to analysis of gene expression data: cancer diagnosis. In: Lin, S.M., Johnson, K.F. (eds). Methods of Microarray Data Analysis, pp. 167–182. Kluwer Academic, Dordrecht (2001) 23. Jalilian, A., Javadi, E., Akrami, M., Fakhrzadeh, H., Heshmat, R., Rahmani, M., Bandarian, F.: Association of cys 311 ser polymorphism of paraoxonase-2 gene with the risk of coronary artery disease. Arch Iran Med. 11(5), 544–549 (2008) 24. Jiang, L.I., Collins, J., Davis, R., Fraser, I.D., Sternweis, P.C.: Regulation of cAMP responses by the G12/13 pathway converges on adenylyl cyclase VII. J. Biol. Chem. 283(34), 23429–23439 (2008) 25. Khan, J., Wei, J.S., Ringnr, M., Saal, L.H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C.R., Peterson, C., Meltzer, P.S.: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. J. Nat. Med. 7, 673–679 (2001) 26. Kim, S., Jin, J., Kunapuli, S.P.: Akt activation in platelets depends on Gi signaling pathways. J. Biol. Chem. 279(6), 4186–4195 (2004) 27. Kutty, R.K., Chen, S., Samuel, W., Vijayasarathy, C., Duncan, T., Tsai, J.Y., Fariss, R.N., Carper, D., Jaworski, C., Wiggert, B.: Cell density-dependent nuclear/cytoplasmic localization of NORPEG (RAI14) protein. Mol. Cell Biol. Res. Commun. 345(4), 1333–1341 (2006) 28. Lee, J.H., Kim, S.H., Lee, E.S., Kim, Y.S.: CD24 overexpression in cancer development and progression: a meta-analysis. Oncol. Rep. 22(5), 1149–1156 (2009) 29. Li, Z., Zhou, Z., Welch, D.R., Donahue, H.J.: Expressing connexin 43 in breast cancer cells reduces their metastasis to lungs. Clin. Exp. Metastasis. 25(8), 893–901 (2008) 30. MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: LeCam, L.M., Neyman, N. (eds.) Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297. University of California Press, Berkeley (1967) 31. Maglott, D., Ostell, J., Pruitt, K.D., Tatusova, T.: Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 35, D26–D31 (2007) 32. Mikosz, C.A., Brickley, D.R., Sharkey, M.S., Moran, T.W., Conzen, S.D.: Glucocorticoid receptor-mediated protection from apoptosis is associated with induction of the serine/threonine survival kinase gene, sgk-1. J. Biol. Chem. 276, 16649–16654 (2001) 33. Ng, C.J., Bourquard, N., Grijalva, V., Hama, S., Shih, D.M., Navab, M., Fogelman, A.M., Lusis, A.J., Young, S., Reddy, S.T.: Paraoxonase-2 deficiency aggravates atherosclerosis in mice despite lower apolipoprotein-B-containing lipoproteins: anti-atherogenic role for paraoxonase-2. J. Biol. Chem. 281(40), 29491–29500 (2006)

J Math Model Algor (2010) 9:275–289

289

34. Roberts, A.N., Leighton, B., Todd, J.A., Cockburn, D., Schofield, P.N., Sutton, R., Holt, S., Boyd, Y., Day, A.J., Foot, E.A.: Molecular and functional characterization of amylin, a peptide associated with type 2 diabetes mellitus. Proc. Natl. Acad. Sci. U.S.A. 86(24), 9662–9666 (1989) 35. Raychaudhuri, S., Stuart, J.M., Altman, R.B.: Principal components analysis to summarize microarray experiments: application to sporulation time series. In: Altman, R.B., Dunker, A.K., Hunker, L., Lauderdale, K., Klein, T.E.D. (eds.) Proceedings of Pacific Symposium on Biocomputing, 4–9 January 2000, Hawaii, pp. 452–463 (2000) 36. Qin, H., Shao, Q., Curtis, H., Galipeau, J., Belliveau, D.J., Wang, T., Alaoui-Jamali, M.A., Laird, D.W.: Retroviral delivery of connexin genes to human breast tumor cells inhibits in vivo tumor growth by a mechanism that is independent of significant gap junctional intercellular communication. J. Biol. Chem. 277(32), 29132–29138 (2002) 37. Sagiv, E., Arber, N.: The novel oncogene CD24 and its arising role in the carcinogenesis of the GI tract: from research to therapy. Expert Rev. Gastroenterol. Hepatol. 2(1), 125–133 (2008) 38. Sato Y, Chen Z, Miyazaki K.. Strong suppression of tumor growth by insulin-like growth factorbinding protein-related protein 1/tumor-derived cell adhesion factor/mac25. Cancer Sci. 98(7), 1055–1063 (2007) 39. Scanlan, M.J., Gordan, J.D., Williamson, B., Stockert, E., Bander, N.H., Jongeneel, V., Gure, A.O., Jger, D., Jger, E., Knuth, A., Chen, Y.T., Old, L.J.: Antigens recognized by autologous antibody in patients with renal-cell carcinoma. Int. J. Cancer 83(4), 456–464 (1999) 40. Scherf, U., Ross, D.T., Waltham, M., Smith, L.H., Lee, J.K., Tanabe, L., Kohn, K.W., Reinhold, W.C., Myers, T.G., Andrews, D.T., Scudiero, D.A., Eisen, M.B., Sausville, E.A., Pommier, Y., Botstein, D., Brown, P.O., Weinstein, J.N.: A gene expression database for the molecular pharmacology of cancer. J. Nat. Genet. 66, 236–244 (2000) 41. Slomnicki, L.P., Nawrot, B., Leniak, W.: S100A6 binds p53 and affects its activity. Int. J. Biochem. Cell. Biol. 41(4), 784–790 (2009) 42. Sneath, P., Sokal, R.: Numerical Taxonomy: the Principles and Practice of Numerical Classification. Freeman, San Francisco (1973) 43. Staunton, J.E., Slonim, D.K., Coller, H.A., Tamayo, P., Angelo, M.J., Park, J., Scherf, U., Lee, J.K., Reinhold, W.O., Weinstein, J.N., Mesirov, J.P., Lander, E.S., Golub, T.R.: Chemosensitivity prediction by transcriptional profiling. Proc. Natl. Acad. Sci. U.S.A. 98, 10787–10792 (2001) 44. Toler, C.R., Taylor, D.D., Gercel-Taylor, C.: Loss of communication in ovarian cancer. Am. J. Obstet. Gynecol. 194(5), 27–31 (2006) 45. Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained K-Means clustering with background knowledge. In: Brodley, C.E., Danyluk, A.P. (eds.) Proceedings of the Eighteenth International Conference on Machine Learning, 28 June–1 July 2001, Williamstown, Massachusetts, pp. 577–584 (2001) 46. Yap, L.F., Jenei, V., Robinson, C.M., Moutasim, K., Benn, T.M., Threadgold, S.P., Lopes, V., Wei, W., Thomas, G.J., Paterson, I.C.: Upregulation of Eps8 in oral squamous cell carcinoma promotes cell migration and invasion through integrin-dependent Rac1 activation. Oncogene 28, 2524–2534 (2009). doi:10.1038/onc.2009.105 47. Yap, L.F., Jenei, V., Robinson, C.M., Moutasim, K., Benn, T.M., Threadgold, S.P., Lopes, V., Wei, W., Thomas, G.J., Paterson, I.C.: Upregulation of Eps8 in oral squamous cell carcinoma promotes cell migration and invasion through integrin-dependent Rac1 activation. Oncogene 28(27), 2524–2534 (2009)