Rough-Fuzzy Relational Clustering Algorithm for Biological Sequence ...

4 downloads 0 Views 102KB Size Report
termed as rough-fuzzy c-medoids, to cluster biological sequences. It com- ... information is the numerical values that represent the degrees to which pairs of.
Rough-Fuzzy Relational Clustering Algorithm for Biological Sequence Mining Pradipta Maji and Sankar K. Pal Center for Soft Computing Research Machine Intelligence Unit, Indian Statistical Institute, India E-mail:{pmaji,sankar}@isical.ac.in

Abstract. This paper presents a hybrid relational clustering algorithm, termed as rough-fuzzy c-medoids, to cluster biological sequences. It comprises a judicious integration of the principles of rough sets, fuzzy sets, c-medoids algorithm, and amino acid mutation matrix used in biology. The concept of crisp lower bound and fuzzy boundary of a class, introduced in rough-fuzzy c-medoids, enables efficient selection of cluster prototypes. The effectiveness of the algorithm, along with a comparison with other algorithms, is demonstrated on different protein data sets.

1

Introduction

Cluster analysis is a technique for finding natural groups present in the data. It divides a given data set into a set of clusters in such a way that two objects from the same cluster are as similar as possible and the objects from different clusters are as dissimilar as possible. In biological sequences, the only available information is the numerical values that represent the degrees to which pairs of sequences in the data set are related. Algorithms that generate partitions of that type of relational data are usually referred to as relational or pair-wise clustering algorithms. An well-known relational clustering algorithm is c-medoids due to Kaufman and Rousseeuw [1]. One of the main problems with biological sequence is the uncertainty. Some of the sources of this uncertainty include incompleteness and vagueness in class definitions of biological data. In this background, fuzzy sets theory [2] and rough sets theory [3], have gained popularity in modeling and propagating uncertainty. Both fuzzy and rough sets provide a mathematical framework to capture uncertainties associated with the data [3]. A recent fuzzy relational clustering algorithm is Krishnapuram’s fuzzy c-medoids [4]. It offers the opportunity to deal with the data that belong to more than one cluster at the same time. Also, it can handle with the uncertainties arising from overlapping cluster boundaries. However, it is very sensitive to noise and outliers. The possibilistic c-medoids [4] is an extension of fuzzy c-medoids, which handles efficiently data sets containing noise and outliers. But, it sometimes generates coincident clusters. In this paper, we present a relational clustering algorithm, termed as roughfuzzy c-medoids algorithm, based on rough sets and fuzzy sets to cluster biological sequences. While the membership function of fuzzy sets enables efficient

handling of overlapping partitions, the concept of lower and upper approximations of rough sets deals with uncertainty, vagueness, and incompleteness in class definition. Each partition is represented by a medoid, a crisp lower approximation, and a fuzzy boundary. The medoid depends on the weighting average of the crisp lower approximation and fuzzy boundary. The similarity between two sequences is computed with reference to a biological similarity matrix (amino acid mutation matrix). In effect, the biological content in the sequences can be maximally utilized for accurate clustering. Some quantitative measures are used to evaluate the quality of the relational clustering algorithm. The effectiveness of the algorithm, along with a comparison with hard c-medoids [1] and fuzzy c-medoids [4], has been demonstrated on different protein data sets.

2

Rough-Fuzzy C-Medoids Algorithm

In this section, we first describe hard c-medoids [1] and fuzzy c-medoids [4], for clustering biological sequences. Next, we describe a novel relational clustering algorithm, termed as rough-fuzzy c-medoids. 2.1

Hard C-Medoids and Fuzzy C-Medoids

The hard c-medoids algorithm [1] uses the most centrally located object in a cluster, which is termed as the medoid. A medoid is essentially an existing data from the cluster, which is closest to the mean of the cluster. Let A be the set of 20 amino acids, X = {x1 , · · · , xj , · · · , xn } be the set of n sequences with m residues, and V = {v1 , · · · , vi , · · · , vc } ⊂ X be the set of c medoids such that vik , xjk ∈ A, ∀ci=1 , ∀nj=1 , ∀m k=1 . The non-gapped pair-wise homology alignment score is considered to compute the similarity between two sequences, which can be calculated using an amino acid mutation matrix [5]. The pair-wise alignment score between xj and vi is then defined as S(xj , vi ) =

m X

M(xjk , vik )

(1)

k=1

where M(xjk , vik ) can be obtained from an amino acid mutation matrix through a table look-up method. The function value is high if two sequences are similar or close to each other, and small if two sequences are distinct. The objective of the hard c-medoids algorithm for clustering biological sequences is to assign n sequences to c clusters. Each of the clusters βi is represented by a medoid vi for that cluster. The process begins by randomly choosing c sequences as the medoids. The sequences are assigned to one of the c clusters based on the maximum value of the non-gapped pair-wise homology alignment score S(xj , vi ) between the sequence xj and the medoid vi . After the assignment of all the sequences to various clusters, the new medoid are calculated as follows: vi = x q

where q = arg max {S(xk , xj )}; xj ∈ βi ; xk ∈ βi 2

(2)

and S(xk , xj ) can be calculated as per (1). The fuzzy c-medoids provides a fuzzification of the hard c-medoids algorithm [4]. For relational clustering of biological sequences, it maximizes J=

n X c X ´ (µij )m {S(xj , vi )}

(3)

j=1 i=1

where 1 ≤ m ´ < ∞ is the fuzzifier, µij ∈ [0, 1] is the fuzzy membership of the sequence xj in cluster βi , such that 1

µij =

´ c  X S(xj , vi ) m−1 l=1

S(xj , vl )

c X

subject to

i=1

µij = 1, ∀j, 0
δ, then xj ∈ A(βi ) as well as xj ∈ A(βi ) and xj ∈ / A(βk ), otherwise xj ∈ B(βi ) and xj ∈ B(βk ). That is, the algorithm first separates the “core” and overlapping portions of each cluster βi based on the threshold value δ. The “core” portion of the cluster βi is represented by its lower approximation A(βi ), while the boundary region B(βi ) represents the overlapping portion. In effect, it minimizes the vagueness and incompleteness in cluster definition. According to the definitions of lower 3

approximations and boundary of rough sets, if a sequence xj ∈ A(βi ), then xj ∈ / A(βk ), ∀k 6= i, and xj ∈ / B(βi ), ∀i. That is, the sequence xj is contained in βi definitely. Thus, the weights of the sequences in lower approximation of a cluster should be independent of other medoids and clusters, and should not be coupled with their similarity with respect to other medoids. Also, the sequences in lower approximation of a cluster should have similar influence on the corresponding medoid and cluster. Whereas, if xj ∈ B(βi ), then the sequence xj possibly belongs to βi and potentially belongs to another cluster. Hence, the sequences in boundary regions should have different influence on the medoids and clusters. So, in rough-fuzzy c-medoids, after assigning each sequence in lower approximations and boundary regions of different clusters based on δ, the memberships µij of the sequences are modified. The membership values of the sequences in lower approximation are set to 1, while those in boundary regions are remain unchanged. In other word, the proposed c-medoids first partitions the data into two classes - lower approximation and boundary. The concept of fuzzy memberships is applied only to the sequences of boundary region, which enables the algorithm to handle overlapping clusters. Thus, in rough-fuzzy c-medoids, each cluster is represented by a medoid, a crisp lower approximation, and a fuzzy boundary. The lower approximation influences the fuzziness of final partition. The fuzzy cmedoids can be reduced from rough-fuzzy c-medoids when A(βi ) = ∅, ∀i. Thus, the proposed algorithm is the generalization of existing fuzzy c-medoids. The new medoids are calculated based on the weighting average of the crisp lower approximation and fuzzy boundary. The medoids calculation is given by:  ˜ × B if A(βi ) 6= ∅, B(βi ) 6= ∅ w × A + w A if A(βi ) 6= ∅, B(βi ) = ∅ (6) vi = x q where q = arg max  B if A(βi ) = ∅, B(βi ) 6= ∅ A=

X

S(xk , xj );

B=

xk ∈A(βi )

X

´ (µik )m S(xk , xj )

xk ∈B(βi )

The parameters w and w ˜ (= 1−w) correspond to the relative importance of lower bound and boundary region. Since the sequences lying in lower approximation definitely belong to a cluster, they are assigned a higher weight w compared to w ˜ of the sequences lying in boundary region. That is, 0 < w ˜ < w < 1.

3

Quantitative Measure

In this section we present some quantitative indices to evaluate the quality of relational clustering for biological sequences. β Index: It is defined as β=

c 1 X 1 X S(xj , vi ) c i=1 ni S(vi , vi ) xj ∈βi

4

(7)

where ni is the number of sequences in the ith cluster βi and S(xj , vi ) is the nongapped pair-wise homology alignment scores between sequence xj and medoid vi . The β index is the average normalized homology alignment scores of input sequences with respect to their corresponding medoids. The β index increases with increase in homology alignment scores within a cluster. The value of β also increases with c. In an extreme case when the number of clusters is maximum, i.e., c = n, the total number of sequences, we have β = 1. Thus, 0 < β ≤ 1. γ Index: It can be defined as   S(vi , vj ) 1 S(vj , vi ) (8) + γ = maxi,j 2 S(vi , vi ) S(vj , vj ) 0 < γ < 1. The γ index calculates the maximum normalized homology alignment score between medoids. A good clustering procedure for medoids selection should make the homology alignment score between all medoids as low as possible. The γ index minimizes the between-cluster homology alignment score. Based on the mutual information, the β index would be as follows: β=

c 1 X 1 X MI(xj , vi ) ; c i=1 ni MI(vi , vi )

MI(xi , xj ) = H(xi ) + H(xj ) − H(xi , xj ) (9)

xj ∈βi

MI(xi , xj ) is the mutual information between sequences xi and xj with H(xi ) and H(xj ) being the entropy of sequences xi and xj respectively, and H(xi , xj ) their joint entropy. H(xi ) and H(xi , xj ) are defined as H(xi ) = −p(xi )lnp(xi )

H(xi , xj ) = −p(xi , xj )lnp(xi , xj )

(10)

p(xi ) and p(xi , xj ) are the a priori probability of xi and joint probability of xi and xj respectively. Similarly, γ index would be   1 MI(vi , vj ) MI(vi , vj ) γ = maxi,j (11) + 2 MI(vi , vi ) MI(vj , vj )

4

Experimental Results

The performance of rough-fuzzy c-medoids (RFCMdd) is compared extensively with that of hard c-medoids (HCMdd) [1] and fuzzy c-medoids (FCMdd) [4]. To analyze the performance of the RFCMdd, we use Cai-Chou HIV data set [6] and caspase cleavage protein sequences downloaded from the NCBI (www.ncbi.nih.gov). The Dayhoff amino acid mutation matrix [5] is used to calculate the non-gapped pair-wise homology score between two sequences. 4.1

Optimum Values of Parameters m, ´ w, and δ

Tables 1-3 report the performance of different c-medoids for different values of m, ´ w, and δ respectively. The results and subsequent discussions are presented here 5

Table 1. Performance of RFCMdd and FCMdd for Different Values of m ´ Value Algorithms Cai-Chou HIV Data Set of m ´ β γ β γ 1.7 RFCMdd 0.794 0.677 0.895 0.950 FCMdd 0.750 0.728 0.868 0.973 1.8 RFCMdd 0.818 0.639 0.907 0.932 FCMdd 0.764 0.695 0.890 0.954 1.9 RFCMdd 0.829 0.618 0.911 0.927 FCMdd 0.809 0.656 0.903 0.941 2.0 RFCMdd 0.829 0.618 0.911 0.927 FCMdd 0.809 0.656 0.903 0.941 2.1 RFCMdd 0.811 0.622 0.908 0.945 FCMdd 0.802 0.671 0.901 0.948 2.2 RFCMdd 0.802 0.640 0.903 0.958 FCMdd 0.767 0.692 0.892 0.977 2.3 RFCMdd 0.791 0.658 0.882 0.961 FCMdd 0.760 0.703 0.877 0.982

Caspase Cleavage β γ β 0.785 0.647 0.907 0.772 0.671 0.883 0.803 0.628 0.923 0.795 0.671 0.890 0.814 0.611 0.937 0.808 0.668 0.898 0.839 0.608 0.942 0.816 0.662 0.901 0.826 0.617 0.935 0.801 0.665 0.899 0.817 0.639 0.928 0.798 0.665 0.895 0.801 0.641 0.901 0.784 0.668 0.886

Proteins γ 0.977 0.978 0.972 0.978 0.965 0.962 0.944 0.953 0.949 0.973 0.954 0.973 0.961 0.979

Table 2. Performance of RFCMdd for Different Values of w (= 1 − w) ˜ Value of w 0.51 0.60 0.70 0.80 0.90 0.99

Cai-Chou HIV Data Set β γ β γ 0.684 0.827 0.806 1.000 0.788 0.708 0.883 0.991 0.829 0.618 0.911 0.927 0.793 0.651 0.874 0.978 0.748 0.711 0.829 1.000 0.671 0.813 0.802 1.000

Caspase Cleavage β γ β 0.683 0.714 0.808 0.779 0.649 0.883 0.839 0.608 0.942 0.817 0.622 0.914 0.761 0.682 0.825 0.675 0.762 0.798

Proteins γ 1.000 0.983 0.944 0.964 1.000 1.000

with respect to β, γ, β, and γ. The fuzzifier m ´ controls the extent of membership sharing between fuzzy clusters. From Table 1, it is seen that as the value of m ´ increases, the values of β and β increase, while γ and γ decrease. The RFCMdd and FCMdd achieve their best performance with m ´ = 1.9 and 2.0 for Cai-Chou HIV data set and m ´ = 2.0 for caspase cleavage protein sequences respectively. But, for m ´ > 2.0, the performance of both algorithms decreases with the increase in m. ´ That is, the best performance of RFCMdd and FCMdd is achieved when the fuzzy membership value of a sequence in a cluster is equal to its normalized homology alignment score with respect to all the medoids. The parameter w has an influence on the performance of RFCMdd. Since the sequences lying in lower approximation definitely belong to a cluster, they are assigned a higher weight w compared to w ˜ of the sequences lying in boundary regions. Hence, for RFCMdd, 0 < w ˜ < w < 1. Table 2 presents the performance of RFCMdd for different values w considering m ´ = 2.0 and δ = 0.20. When 6

the sequences of both lower approximation and boundary region are assigned approximately equal weights, the performance of RFCMdd is significantly poorer than HCMdd. As the value of w increases, the values of β and β increase, while γ and γ decrease. The best performance of both algorithms is achieved with w = 0.70. The performance significantly reduces with w ' 1.00. In this case, since the clusters cannot see the sequences of boundary regions, the mobility of the clusters and the medoids reduces. As a result, some medoids get stuck in local optimum. On the other hand, when w = 0.70, the sequences of lower approximations are assigned a higher weight compared to that of boundary regions as well as the clusters and the medoids have a greater degree of freedom to move. In effect, the quality of generated clusters is better compared to other values of w. Table 3. Performance of RFCMdd for Different Values of δ Value of δ 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

Cai-Chou HIV Data Set β γ β γ 0.713 0.782 0.817 1.000 0.753 0.707 0.868 1.000 0.794 0.683 0.882 0.991 0.806 0.629 0.902 0.964 0.829 0.618 0.911 0.927 0.811 0.638 0.907 0.952 0.805 0.681 0.894 0.988 0.784 0.704 0.875 1.000

Caspase Cleavage β γ β 0.707 0.698 0.862 0.766 0.683 0.881 0.801 0.641 0.907 0.819 0.622 0.928 0.839 0.608 0.942 0.814 0.631 0.932 0.791 0.667 0.908 0.772 0.671 0.881

Proteins γ 1.000 1.000 0.995 0.973 0.944 0.980 0.995 1.000

The performance of RFCMdd also depends on the value of δ, which determines the class labels of all the sequences. In other word, the RFCMdd partitions the data set of a cluster into two classes - lower approximation and boundary, based on the value of δ. Table 3 presents the performance of RFCMdd for different values of δ considering m ´ = 2.0 and w = 0.70. For δ = 0.0, all the sequences will be in lower approximations of different clusters and B(βi ) = ∅, ∀i. In effect, the RFCMdd reduces to conventional HCMdd. On the other hand, for δ = 1.0, A(βi ) = ∅, ∀i and all the sequences will be in the boundary regions of different clusters. That is, the RFCMdd boils down to FCMdd. The best performance of RFCMdd with respect to β, β, γ, and γ is achieved with δ = 0.2. This is approximately equal to the average difference of highest and second highest fuzzy membership values of all the sequences. 4.2

Comparative Performance of Different Relational Algorithms

Finally, Table 4 provides the comparative results of different algorithms. It is seen that the RFCMdd produces medoids having the highest β and β values and lowest γ and γ values for all the cases. Table 4 also provides execution time of different algorithms for two data sets. The execution time required for RFCMdd 7

Table 4. Comparative Performance of Different Methods Data Set Algorithms β γ β γ Time (milli sec.) CaiRFCMdd 0.829 0.618 0.911 0.927 6217 Chou FCMdd 0.809 0.656 0.903 0.941 4083 HIV HCMdd 0.713 0.782 0.817 1.000 718 Caspase RFCMdd 0.839 0.608 0.942 0.944 513704 Cleavage FCMdd 0.816 0.662 0.901 0.953 510961 Protein HCMdd 0.707 0.698 0.862 1.000 8326

is comparable to FCMdd. For the HCMdd, although the execution time is less, the performance is significantly poorer than that of FCMdd and RFCMdd. Use of rough and fuzzy sets adds a small computational load to the HCMdd; however the corresponding integrated methods (FCMdd and RFCMdd) show a definite increase in β and β values and decrease in γ and γ values. Integration of rough sets, fuzzy sets, and c-medoids, in the RFCMdd algorithm produces a set of most informative medoids in the comparable computation time.

5

Conclusion

The main contribution of the paper is to develop a methodology integrating the merits of rough sets, fuzzy sets, c-medoids algorithm, and amino acid mutation matrix for clustering biological sequences. Although the methodology has been efficiently demonstrated for biological sequence analysis, the concept can be applied to other relational unsupervised classification problems. Acknowledgement.The authors would like to thank the DST, Govt. of India for funding the CSCR under its IRHPA scheme. The work was done when one of the authors, S.K. Pal, was a J. C. Bose Fellow of the Govt. of India.

References 1. L. Kaufman and P. J. Rousseeuw, Finding Groups in Data, An Introduction to Cluster Analysis. JohnWiley & Sons, Brussels, Belgium, 1990. 2. L. A. Zadeh, “Fuzzy Sets,” Information and Control, vol. 8, pp. 338–353, 1965. 3. Z. Pawlak, Rough Sets, Theoretical Aspects of Resoning About Data. Dordrecht, The Netherlands: Kluwer, 1991. 4. R. Krishnapuram, A. Joshi, O. Nasraoui, and L. Yi, “Low Complexity Fuzzy Relational Clustering Algorithms for Web Mining,” IEEE Transactions on Fuzzy System, vol. 9, pp. 595–607, 2001. 5. M. S. Johnson and J. P. Overington, “A Structural Basis for Sequence Comparisons: An Evaluation of Scoring Methodologies,” Journal of Molecular Biology, vol. 233, pp. 716–738, 1993. 6. Y. D. Cai and K. C. Chou, “Artificial Neural Network Model for Predicting HIV Protease Cleavage Sites in Protein,” Advances in Engineering Software, vol. 29, no. 2, pp. 119–128, 1998.

8

Suggest Documents