Protein Subcellular Localization Prediction Based on ... - CiteSeerX

0 downloads 0 Views 3MB Size Report
Aug 5, 2008 - Structure prediction and classification, protein stability, structural alignment ..... Predicted SSEs C C E E E E C C C H H H H H H C C E E E H H.
ACADEMIA SINICA

Institute of Information Science

Protein Subcellular Localization Prediction Based on Machine Learning Approaches

Wen-Lian Hsu (許聞廉老師) Emily Chia-Yu Su (蘇家玉) Aug 5 2008 台師大 資料探勘課程

ACADEMIA SINICA

Institute of Information Science

Outline „

Introduction Data mining ‹ Protein subcellular localization prediction ‹

„

A support vector machine model-based method Support vector machines ‹ Compartment-specific biological features ‹

„

A probabilistic latent semantic analysis-based method Gapped-dipeptides ‹ Probabilistic latent semantic analysis ‹

„ „

Applications of protein localization prediction Conclusion 2/78

ACADEMIA SINICA

Institute of Information Science

About Myself „

Education Ph.D. Candidate, Bioinformatics Program, Taiwan International Graduate Program (TIGP), Academia Sinica ‹ M.S., Department of Computer Science and Information Engineering, National Taiwan University ‹ B.S., Department of Information and Computer Education, National Taiwan Normal University ‹

„

Research interests ‹

Bioinformatics, computational biology, machine learning, data mining, text mining, natural language processing, information retrieval/extraction 3/78

ACADEMIA SINICA

Institute of Information Science

About TIGP „

TIGP programs ‹ ‹ ‹ ‹ ‹ ‹ ‹ ‹

‹

„

Chemical Biology and Molecular Biophysics (CBMB, 2002) Molecular Science and Technology (MST, 2002) Molecular and Biological Agricultural Sciences (MBAS, 2003) Bioinformatics (Bio, 2003) Molecular and Cell Biology (MCB, 2003) Nano Science and Technology (Nano, 2003) Molecular Medicine (MM, 2004) Computational Linguistics and Chinese Language Processing (CLCLP, 2005) Early System Science, (ESS, will admit students in 2009)

Website: http://tigp.sinica.edu.tw/ 4/78

ACADEMIA SINICA

Institute of Information Science

About Bioinformatics Program (1/2) „

Wen-Lian Hsu (許聞廉): ‹

„

Der-Tsai Lee (李德財): ‹

„

Computational geometry, parallel and distributed computing, web-based computing, digital libraries, bioinformatics

Wen-Hsiung Li (李文雄): ‹

„

Natural language processing, literature mining, proteomics, protein structure prediction

Molecular evolution, comparative genomics, population genetics, evolution of gene regulation, computational biology

Wen-Chang Lin (林文昌): ‹

Bioinformatics, tumor biology, cancer metastasis 5/78

ACADEMIA SINICA

Institute of Information Science

About Bioinformatics Program (2/2) „

Jenn-Kang Hwang (黃鎮剛): ‹

„

Cathy S.J. Fann (范盛娟): ‹

„

Biostatistics, genetic Epidemiology, genetic Statistics, disease gene napping, population genetics

Grace S. Shieh (謝叔蓉): ‹

„

Structure prediction and classification, protein stability, structural alignment, molecular simulation

Biostatistics, microarray analysis, gene regulatory network prediction, protein interaction networks, comparative genomics

Ueng-Cheng Yang (楊永正): ‹

Bioinformatics, infobiology, RNA-structure analysis, RNAprotein interaction, comparative genomics, genome annotation 6/78

ACADEMIA SINICA

Institute of Information Science

Outline „

Introduction Data mining ‹ Protein subcellular localization prediction ‹

„

A support vector machine model-based method Support vector machines ‹ Compartment-specific biological features ‹

„

A probabilistic latent semantic analysis-based method Gapped-dipeptides ‹ Probabilistic latent semantic analysis ‹

„ „

Applications of protein localization prediction Conclusion 7/78

ACADEMIA SINICA

Institute of Information Science

Origins of Data Mining „

Goals in data mining ‹

„

Draws ideas from machine learning/pattern recognition, statistics/AI, and database systems

Traditional techniques may be unsuitable due to Enormity of data ‹ High dimensionality of data ‹ Heterogeneous, distributed nature of data

Statistics/ AI

Machine Learning/ Pattern Recognition

Data Mining

‹

Database systems

擷取自2008/7/23 柯佳伶老師“資料探勘簡介"

8/78

ACADEMIA SINICA

Institute of Information Science

Two Types of Data Mining Methods „

Prediction methods ‹

„

Use some variables to predict unknown or future values of other variables

Description methods ‹

Find human-interpretable patterns that describe the data

擷取自2008/7/23 柯佳伶老師“資料探勘簡介"

9/78

ACADEMIA SINICA

Institute of Information Science

Different Tasks in Data Mining „ „ „ „ „ „

Classification [Predictive] Clustering [Descriptive] Association rule discovery [Descriptive] Sequential pattern discovery [Descriptive] Regression [Predictive] Deviation detection [Predictive]

擷取自2008/7/23 柯佳伶老師“資料探勘簡介"

10/78

ACADEMIA SINICA

Institute of Information Science

Definition of Classification „

Definition: ‹

Given a collection of records (i.e., training set) ™

‹

„

Each record contains a set of attributes

Find a model for class attribute as a function of the values of other attributes

Goal: Assign a class to previously unseen records as accurately as possible ‹ A test set is used to determine the accuracy of the model ‹

擷取自2008/7/23 柯佳伶老師“資料探勘簡介"

11/78

ACADEMIA SINICA

Institute of Information Science

Applications in Classification „

Examples of classification: ‹

Direct marketing ™

‹

Fraud detection ™

‹

Predict whether a customer is likely to be lost to a competitor

Sky survey cataloging ™

„

Predict fraudulent cases in credit card transactions

Customer attrition/churn ™

‹

Predict whether a consumer is likely to buy a new cell-phone

Predict class (star or galaxy) of sky objects based on the telescopic survey images

More applications in biology? 擷取自2008/7/23 柯佳伶老師“資料探勘簡介"

12/78

ACADEMIA SINICA

Institute of Information Science

Outline „

Introduction Data mining ‹ Protein subcellular localization prediction ‹

„

A support vector machine model-based method Support vector machines ‹ Compartment-specific biological features ‹

„

A probabilistic latent semantic analysis-based method Gapped-dipeptides ‹ Probabilistic latent semantic analysis ‹

„ „

Applications of protein localization prediction Conclusion 13/78

ACADEMIA SINICA

Institute of Information Science

Protein Subcellular Localization (PSL) Prediction „

Predict where the protein is located in a cell? ‹ ‹ ‹ ‹ ‹

C1: cytoplasm C2: inner membrane C3: periplasm C4: outer membrane C5: extracellular space

Gram-negative bacteria Gardy LJ, et al. Methods for Predicting Bacterial Protein Subcellular Localization. Nature Reviews Microbiology, 2006.

14/78

ACADEMIA SINICA

Institute of Information Science

Importance of PSL Prediction „

Protein function identification ‹

„

Genome annotation ‹

„

Modulate and identify protein functions Annotate genomic features

Drug discovery ‹

Give clues to new drug targets

15/78

ACADEMIA SINICA

Institute of Information Science

The Available Computational Methods for Bacterial PSL Prediction

Gardy LJ, et al. Methods for Predicting Bacterial Protein Subcellular Localization. Nature Reviews Microbiology, 2006. 16/78

ACADEMIA SINICA

Institute of Information Science

Outline „

Introduction Data mining ‹ Protein subcellular localization prediction ‹

„

A support vector machine model-based method Support vector machines ‹ Compartment-specific biological features ‹

„

A probabilistic latent semantic analysis-based method Gapped-dipeptides ‹ Probabilistic latent semantic analysis ‹

„ „

Applications of protein localization prediction Conclusion 17/78

ACADEMIA SINICA

Institute of Information Science

Classification Formulation „

Given

an input space ℜ ‹ a set of classesω ={ ω1 , ω 2 ,..., ω c } ‹

„

the Classification Problem is ‹ to define a mapping f:ℜJ ω where each x in ℜ is assigned to one class

„

This mapping function is called a decision function

18/78

ACADEMIA SINICA

Institute of Information Science

Decision Function (1/2) „

The basic problem in classification problem is to find c decision functions

d1 (x ), d 2 ( x ),..., d c (x ) with the property that, if a pattern x belongs to class i, then d i (x ) > d j ( x ) i, j = 1,2,...c; j ≠ i

d i ( x ) is a similarity measure between x and class i, such as distance or probability concept 19/78

ACADEMIA SINICA

Institute of Information Science

Decision Function (2/2) „

Example

d1=d3

Class 1 d2,d3

Suggest Documents