Identification of Melanoma (Skin Cancer) Proteins through Support Vector Machine Babita Rathore1, Sandeep K. Kushwaha1, and Madhvi Shakya2 1
Department of Bioinformatics, MANIT, Bhopal 462051, India Department of Mathematics, MANIT, Bhopal 462051, India
[email protected],
[email protected],
[email protected] 2
Abstract. Melanoma is a form of cancer that begins in melanocytes. The occurrence of melanoma continues to rise across the world and current therapeutic options are of limited benefit. Researchers are studying the genetic changes in skin tissue linked to a life-threatening melanoma through SNP genotyping, Expression microarrays, RNA interference etc. In the spectrum of disease, identification and characterization of melanoma proteins is also very important task. In the present study, effort has been made to identify the melanoma protein through Support Vector Machine. A positive dataset has been prepared through databases and literature whereas negative dataset consist of core metabolic proteins. Total 420 compositional properties of amino acid dipeptide and multiplet frequencies have been used to develop SVM model classifier. Average performance of models varies from 0.65-0.80 Mathew’s correlation coefficient values and 91.56% accuracy has been achieved through random data set. Keywords: Skin Cancer, Support Vector Machines, Melanoma, and Compositional property.
1 Introduction After several decades, Cancer is still burning research area due to its diversity in the sites, origin and patho-physio mechanism. Mainly, skin cancer arises in the outer most layer of the skin i.e. epidermis. Especially, Epithelial cells which is produces three cutaneous barriers to the environmental factors and a number of neoplasm originate from cutaneous epithelial cell [1]. Melanoma is one of the popularly known types of skin cancer. Melanoma is a form of cancer that begins in melanocytes. It may begin in a mole (skin melanoma), but can also begin in other pigmented tissues, like eye or intestines. Melanoma causes the majority (75%) of skin cancer related deaths. The incidence of melanoma has rapidly risen in developed countries during the past several decades [2]. In 2000, World statistical analysis of cancer has been reported the 47,700 new case of melanoma and 7700 deaths [3]. Estimated new cases and deaths from melanoma in the United States in year 2007 are 59,940 and 8,110 respectively. The life time risk of developing melanoma is now 1 in 49 for men and 1 in 73 for women [4]. In 2009, there will be an estimated 68,720 new cases of melanoma diagnosed with 8650 deaths [5]. Melanoma can be classifies into four main classes: Superficial spreading melanoma, V. V Das, R. Vijaykumar et al. (Eds.): ICT 2010, CCIS 101, pp. 571–575, 2010. © Springer-Verlag Berlin Heidelberg 2010
572
B. Rathore, S.K. Kushwaha, and M. Shakya
Nodular melanoma, Lentigo maligna melanoma and Acral lentiginous melanoma [6]. The major risk factor for developing melanoma are intrinsic (genetic) and environmental (sun exposure) [7]. Initial signs and symptoms of melanoma are much diversified which includes the asymmetry in size or shape, change in colour, itching, tenderness, bleeding, and ulceration of the suspicious lesion [8]. Melanoma diagnosis at any stage of disease, people with melanoma may have treatment to control pain and other symptoms of the cancer, to relieve the side effects of therapy. Medically, melanoma treatment includes surgery, chemotherapy, biological therapy, or radiation therapy or a combination of treatments. [9]. Researchers are studying the genetic link in melanoma by employing high-tech techniques including SNP, Micro array, and RNA interference etc [4]. In recent past researchers are focusing towards the identification of melanoma protein, which contribute to the development of melanoma. The ccompletion of human genome project and computational advancement have facilitated the identification of various kinds of human proteins. The availability of genomic sequences from genome project accelerated the experimental identification and characterization of predicted proteins. Identification and characterization can be simplify through application of tools and techniques, algorithms and prediction methods. In the present study we elucidate the method to identify melanoma causing protein by implying Support Vector Machine. SVMs have demonstrated highly competitive performance in numerous real-world applications, such as bioinformatics, text mining, face recognition, and image processing, which has established SVMs as one of the state-of the-art tools for machine learning and data mining, along with other soft computing techniques, e.g., neural networks and fuzzy systems. [10]. But SVM has been proved to perform much better in many classification problems than ANN [11].
2 Materials and Methods 2.1 Preparation of Datasets and Redundancy Removal Melanoma protein sequences have been retrieved from the NCBI, SWISS-PROT databases [12]. Retrieved sequences are filtered at first level i.e. remove the sequences through keyword filters "Probable", “Putative", "Hypothetical", "Unknown" and "Possible". Non-redundancy verification of prepared datasets has been done through BLASTclust. The strict criterions of BLASTclust (i.e. 100% Sequence coverage, 100% Sequence identity) have been used to verify non-redundancy of datasets. 76 sequences in positive dataset and 304 sequences in negative dataset have been found after the filtering of second level. 2.2 Used Compositional Properties a.
Dipeptide frequencies: The frequency of a dipeptide (i,j) 100 2 Where i, j = 1–20. There are 20*20 = 400 possible dipeptides.
Identification of Melanoma (Skin Cancer) Proteins through Support Vector Machine
b.
573
Multiplet frequencies: Multiplets are defined as homo-polymeric stretches of (Xn), where X is the amino acid and n (integer) ≥ 2. After identifying all the multiplets, the frequencies of the amino acids in the multiplets were computed as follows:
Where, L is the length of the sequence. There are 20 possible values for fi (m) for 20 amino acids. For each sequence, 420 compositional frequencies have been used as input as numerical features [13]. 2.3 Support Vector Machine Support vector machine (SVM) is a supervised machine learning technique applicable to both classification and regression. In 1995, Vladimir Vapnik and co-workers developed the statistical learning technique at AT&T Bell Laboratories. SVM classifiers for melanoma protein identification have been developed through freely downloadable package of SVM, SVMlight (http://www.cs.cornell.edu/People/tj/svm_light/). Selection of kernel and supporting parameters were optimized for best performance through cross validation techniques [14]. 2.4 Performance Evaluation Performance evaluation has been done by calculating specificity (SP), sensitivity (SN) and Mathew’s correlation coefficient (MCC) [15].
.
.
3 Result and Discussions Various models have been generated to identify melanoma proteins through Support Vector Machine. Fourteen SVM classifier models have been identified through an exhaustive search from 255 kernel parameters set. Most popularly used kernels for classification tasks are polynomial function and radial basis function (RBF). For polynomial kernel, all the SVM parameters have been used default, except d and c, the trade-off between training error and margin. The scalable memory parameter (m) have fixed to 200 [16]. The values for d and c have been incremented stepwise
574
B. Rathore, S.K. Kushwaha, and M. Shakya
through a combination of d and c. For polynomial the range for d was between 1 - 7 and c was 1e-07 -1e+07. Similarly, In RBF gamma (g), the value of g has been incremented stepwise with combination of parameter c. [12]. For RBF range of g was between 1e-07 – 100 and c was between 0.01 -1e+12. For evaluation of method, five-fold cross validation has been adopted. 255 classification models have been created from each of the five training data-sets. Performance assessments of models have been done through Mathew’s correlation coefficient (MCC) and these models show the better performance from 0.65-0.80 MCC values [17]. Best performing models has been shown in the table-1. Table 1. Shows the parameter sets and performance of selected models to identify the melanoma proteins
S.N.
Identified Model No. (Classifiers)
1. 2. 3. 4. 5 6. 7. 8. 9. 10. 11. 12. 13. 14.
7 20 114 127 128 141 142 155 156 169 170 183 184 198
Kernel Type Polynomial Polynomial RBF RBF RBF RBF RBF RBF RBF RBF RBF RBF RBF RBF
Parameters
d=1 c= 0.1 d=2 c = 0.001 g =1e-07 c = 106 g =1e-06 c = 104 g =1e-06 c = 105 g =1e-05 c =103 g =1e-05 c = 104 g =0.0001c= 100 g=0.0001 c= 103 g =0.001 c = 10 g =0.001 c = 100 g =0.01 c = 1 g =0.01 c = 10 g =0.1 c = 1
Mean MCC for the parameters across five test subset 0.93 0.90 0.96 0.83 0.96 0.83 0.96 0.83 0.95 0.84 0.96 0.83 0.96 0.96
Accuracy
92% 91% 94% 86% 94% 88% 95% 88% 95% 88% 95% 86% 95% 95%
4 Conclusion The occurrence of melanoma cases continues to rise across the world. The life time risk of developing melanoma in men are relatively high than women and the current therapeutic options for metastatic melanoma appear to be of limited benefit. So, identification of new therapeutic methods is needed. In this order, identification and characterization of melanoma proteins is also very important task. In this work, melanoma protein prediction method using Support Vector Machine is proposed and it classifies melanoma sequences from non-melanoma proteins with an accuracy of 91.56%. Fourteen best model have been identified for melanoma protein prediction which shown in table-1. In the appealing background of the work, the identification of several proteins of unknown function as melanoma-like proteins could generate new leads for further characterization.
Identification of Melanoma (Skin Cancer) Proteins through Support Vector Machine
575
References 1. Owens, D.M., Watt, F.M.: Contribution of stem cells and differentiated cells to epidermal tumours. Nat. Rev. Cancer 3, 444–451 (2003) 2. Berwick, M., Erdei, E., Hay, J.: Melanoma epidemiology and public health. Dermatol. Clin. 27, 205–214 (2009) 3. Greenlee, R.T., Murray, T., Bolden, S., Wingo, P.A.: Cancer statistics. CA Cancer J. Clin. 50, 7–33 (2000) 4. Sanjiv, S., Agarwala, M.D.: Metastatic melanoma: an AJCC review. Community Oncology 5(8), 441–445 (2008) 5. Jemal, A., Siegel, R., Ward, E., Hao, Y., Xu, J., Thun, M.J.: Cancer statistics. CA Cancer J. Clin. 59, 225–249 (2009) 6. Hartleb, J., Arndt, R.: Cysteine and indole derivatives as markers for malignant melanoma. Journal of chromatography B 764, 409–443 (2001) 7. Wang, S., Setlow, R., Berwick, M., Polsky, D., Marghoob, A., Kopf, A., Bart, R.: Ultraviolet A and melanoma: a review. J. Am. Acad. Dermatol. 44(5), 837–846 (2001) 8. Elwood, J.M., Gallagher, R.P.: The first signs and symptoms of melanoma: a populationbased study. Pigment Cell Res. 9, 118–130 (1988) 9. Oliveria, S.A., Christos, P.J., Halpern, A.C., Fine, J.A., Barnhill, R.L., Berwick, M.: Patient knowledge, awareness, and delay in seeking medical attention for malignant melanoma. Journal of Clinical Epidemiology 52(11), 1111–1116 (1999) 10. Kecman, V.: Support Vector Machines – An Introduction. Stud. Fuzz. 177, 1–47 (2005) 11. http://www.imtech.res.in/raghava/ctlpred/about.html 12. Bairoch, A., Apweiler, R.: The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acid Res. 28(1), 45–48 (2002) 13. Ansari, F.A., Naveen, K., Subramanyam, M.B., Gnanamani, M., Ramachandran, S.: MAAP: Malarial adhesins and adhesin-like proteins predictor. Proteins 70, 659–666 (2008) 14. Joachims, T.: Making large-scale SVM learning particle. In: Scholkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods Support Vector Learning, pp. 42–56. MIT Press, Cambridge (1999) 15. Kushwaha, S.K., Shakya, M.: Neural Network: A Machine Learning Technique for Tertiary Structure Prediction of Proteins from Peptide Sequences, act. In: 2009 International Conference on Advances in Computing, Control, and Telecommunication Technologies, pp. 98–101 (2009) 16. Bhasin, M., Raghava, G.P.S.: Analysis and prediction of affinity of TAP binding peptides using cascade SVM. Protein Sci. 13, 596–607 (2004) 17. Matthews, B.W.: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta 405, 442–451 (1975)