Aug 5, 2008 - Structure prediction and classification, protein stability, structural alignment ..... Predicted SSEs C C E E E E C C C H H H H H H C C E E E H Hï¼
ACADEMIA SINICA
Institute of Information Science
Protein Subcellular Localization Prediction Based on Machine Learning Approaches
Wen-Lian Hsu (許聞廉老師) Emily Chia-Yu Su (蘇家玉) Aug 5 2008 台師大 資料探勘課程
ACADEMIA SINICA
Institute of Information Science
Outline
Introduction Data mining Protein subcellular localization prediction
A support vector machine model-based method Support vector machines Compartment-specific biological features
A probabilistic latent semantic analysis-based method Gapped-dipeptides Probabilistic latent semantic analysis
Applications of protein localization prediction Conclusion 2/78
ACADEMIA SINICA
Institute of Information Science
About Myself
Education Ph.D. Candidate, Bioinformatics Program, Taiwan International Graduate Program (TIGP), Academia Sinica M.S., Department of Computer Science and Information Engineering, National Taiwan University B.S., Department of Information and Computer Education, National Taiwan Normal University
Research interests
Bioinformatics, computational biology, machine learning, data mining, text mining, natural language processing, information retrieval/extraction 3/78
ACADEMIA SINICA
Institute of Information Science
About TIGP
TIGP programs
Chemical Biology and Molecular Biophysics (CBMB, 2002) Molecular Science and Technology (MST, 2002) Molecular and Biological Agricultural Sciences (MBAS, 2003) Bioinformatics (Bio, 2003) Molecular and Cell Biology (MCB, 2003) Nano Science and Technology (Nano, 2003) Molecular Medicine (MM, 2004) Computational Linguistics and Chinese Language Processing (CLCLP, 2005) Early System Science, (ESS, will admit students in 2009)
Website: http://tigp.sinica.edu.tw/ 4/78
ACADEMIA SINICA
Institute of Information Science
About Bioinformatics Program (1/2)
Wen-Lian Hsu (許聞廉):
Der-Tsai Lee (李德財):
Computational geometry, parallel and distributed computing, web-based computing, digital libraries, bioinformatics
Wen-Hsiung Li (李文雄):
Natural language processing, literature mining, proteomics, protein structure prediction
Molecular evolution, comparative genomics, population genetics, evolution of gene regulation, computational biology
Wen-Chang Lin (林文昌):
Bioinformatics, tumor biology, cancer metastasis 5/78
ACADEMIA SINICA
Institute of Information Science
About Bioinformatics Program (2/2)
Jenn-Kang Hwang (黃鎮剛):
Cathy S.J. Fann (范盛娟):
Biostatistics, genetic Epidemiology, genetic Statistics, disease gene napping, population genetics
Grace S. Shieh (謝叔蓉):
Structure prediction and classification, protein stability, structural alignment, molecular simulation
Biostatistics, microarray analysis, gene regulatory network prediction, protein interaction networks, comparative genomics
Ueng-Cheng Yang (楊永正):
Bioinformatics, infobiology, RNA-structure analysis, RNAprotein interaction, comparative genomics, genome annotation 6/78
ACADEMIA SINICA
Institute of Information Science
Outline
Introduction Data mining Protein subcellular localization prediction
A support vector machine model-based method Support vector machines Compartment-specific biological features
A probabilistic latent semantic analysis-based method Gapped-dipeptides Probabilistic latent semantic analysis
Applications of protein localization prediction Conclusion 7/78
ACADEMIA SINICA
Institute of Information Science
Origins of Data Mining
Goals in data mining
Draws ideas from machine learning/pattern recognition, statistics/AI, and database systems
Traditional techniques may be unsuitable due to Enormity of data High dimensionality of data Heterogeneous, distributed nature of data
Statistics/ AI
Machine Learning/ Pattern Recognition
Data Mining
Database systems
擷取自2008/7/23 柯佳伶老師“資料探勘簡介"
8/78
ACADEMIA SINICA
Institute of Information Science
Two Types of Data Mining Methods
Prediction methods
Use some variables to predict unknown or future values of other variables
Description methods
Find human-interpretable patterns that describe the data
擷取自2008/7/23 柯佳伶老師“資料探勘簡介"
9/78
ACADEMIA SINICA
Institute of Information Science
Different Tasks in Data Mining
Classification [Predictive] Clustering [Descriptive] Association rule discovery [Descriptive] Sequential pattern discovery [Descriptive] Regression [Predictive] Deviation detection [Predictive]
擷取自2008/7/23 柯佳伶老師“資料探勘簡介"
10/78
ACADEMIA SINICA
Institute of Information Science
Definition of Classification
Definition:
Given a collection of records (i.e., training set)
Each record contains a set of attributes
Find a model for class attribute as a function of the values of other attributes
Goal: Assign a class to previously unseen records as accurately as possible A test set is used to determine the accuracy of the model
擷取自2008/7/23 柯佳伶老師“資料探勘簡介"
11/78
ACADEMIA SINICA
Institute of Information Science
Applications in Classification
Examples of classification:
Direct marketing
Fraud detection
Predict whether a customer is likely to be lost to a competitor
Sky survey cataloging
Predict fraudulent cases in credit card transactions
Customer attrition/churn
Predict whether a consumer is likely to buy a new cell-phone
Predict class (star or galaxy) of sky objects based on the telescopic survey images
More applications in biology? 擷取自2008/7/23 柯佳伶老師“資料探勘簡介"
12/78
ACADEMIA SINICA
Institute of Information Science
Outline
Introduction Data mining Protein subcellular localization prediction
A support vector machine model-based method Support vector machines Compartment-specific biological features
A probabilistic latent semantic analysis-based method Gapped-dipeptides Probabilistic latent semantic analysis
Applications of protein localization prediction Conclusion 13/78
ACADEMIA SINICA
Institute of Information Science
Protein Subcellular Localization (PSL) Prediction
Predict where the protein is located in a cell?
C1: cytoplasm C2: inner membrane C3: periplasm C4: outer membrane C5: extracellular space
Gram-negative bacteria Gardy LJ, et al. Methods for Predicting Bacterial Protein Subcellular Localization. Nature Reviews Microbiology, 2006.
14/78
ACADEMIA SINICA
Institute of Information Science
Importance of PSL Prediction
Protein function identification
Genome annotation
Modulate and identify protein functions Annotate genomic features
Drug discovery
Give clues to new drug targets
15/78
ACADEMIA SINICA
Institute of Information Science
The Available Computational Methods for Bacterial PSL Prediction
Gardy LJ, et al. Methods for Predicting Bacterial Protein Subcellular Localization. Nature Reviews Microbiology, 2006. 16/78
ACADEMIA SINICA
Institute of Information Science
Outline
Introduction Data mining Protein subcellular localization prediction
A support vector machine model-based method Support vector machines Compartment-specific biological features
A probabilistic latent semantic analysis-based method Gapped-dipeptides Probabilistic latent semantic analysis
Applications of protein localization prediction Conclusion 17/78
ACADEMIA SINICA
Institute of Information Science
Classification Formulation
Given
an input space ℜ a set of classesω ={ ω1 , ω 2 ,..., ω c }
the Classification Problem is to define a mapping f:ℜJ ω where each x in ℜ is assigned to one class
This mapping function is called a decision function
18/78
ACADEMIA SINICA
Institute of Information Science
Decision Function (1/2)
The basic problem in classification problem is to find c decision functions
d1 (x ), d 2 ( x ),..., d c (x ) with the property that, if a pattern x belongs to class i, then d i (x ) > d j ( x ) i, j = 1,2,...c; j ≠ i
d i ( x ) is a similarity measure between x and class i, such as distance or probability concept 19/78
ACADEMIA SINICA
Institute of Information Science
Decision Function (2/2)
Example
d1=d3
Class 1 d2,d3