Jun 21, 2017 - â¢Application of Cloud Computing in. Bioinformatics .... Application and Use of databases .... technologies has initiated the 'personal genome ...
Trends in Computational Biology and Bioinformatics in the Era of Big Data Analytics
Ajit Kumar Roy • Ex. National Consultant (IA) for East and North East Region, NAIP,ICAR • Ex. Consultant (Statistics), College of Fisheries, CAU,, Agartala • Ex. Principal Scientist & Co-coordinator, Bioinformatics Centre of CIFA (ICAR), Bhubaneswar; • Ex. Computer Specialist, SAARC Agricultural Information Centre,, Dhaka, Bangladesh
Presented at 'International Workshop on Bioinformatics in Fisheries and Aquaculture’ held at ICAR- CIFRI, Barrackpore during 19-21 June 2017
Contents ➢ Basic concept about Bioinformatics ➢ Advances in bioinformatics /Computational Biology ➢ Growth of Genomic & Proteomics data Data ➢ Introduction & basics of big data ➢ Cloud Computing ➢ Potential value & benefit of big data analytics ➢ Application of Cloud Computing in Bioinformatics ➢ Big data and privacy concerns
Bioinformatics BIOlogy
INFORmatics
matheMATICS
Interactions of Disciplines
….
Bioinformatics • The analysis of large volumes of genomic, proteomic and metabolomic data requiring sophisticated algorithms and powerful computers • Rapidly evolving field with an extreme shortage of skilled workers to write programs and analyze data
Goals of bioinformatics • • • • • • •
Classify Identify patterns/ Pattern Recognition Make predictions Data Modelling,Creation of models & Prediction Assessment and Comparison Optimization Better utilize existing knowledge
Bioinformatic Goals • To understand integrative aspects of the biology of organisms, viewed as coherent complex structures • To interrelate sequences, 3-D structures, interactions and functions of proteins, nucleic acids and their complexes • To study the evolution of biological systems • To support applications in agricultural, pharmaceutical and other scientific fields
Dr.S.Parthasarathy, NIT, Trichy
9
21/6/2017
System Biology
sys
tem
iolo
gy mo
t
s ure
uct
genomics
str
proteomics
lar
transcriptomics
metabolomics
u lec
e ne n e g
ks wor
sb
Genomics, Proteomics & Systems Biology
Genomics
Proteomics
Systems Biology
1990
1995
2000
2005
2010
2015
2020
Examples of Bioinformatics • Database interfaces • Genbank/EMBL/DDBJ, Medline, SwissProt, PDB, …
• Sequence alignment • BLAST, FASTA
• Multiple sequence alignment • Clustal, MultAlin, DiAlign
• Gene finding • Genscan, GenomeScan, GeneMark, GRAIL
• Protein Domain analysis and identification • pfam, BLOCKS, ProDom,
• Pattern Identification/Characterization • Gibbs Sampler, AlignACE, MEME
• Protein Folding prediction • PredictProtein, SwissModeler
Five websites that all biologists should know • NCBI (The National Center for Biotechnology Information; • http://www.ncbi.nlm.nih.gov/
• EBI (The European Bioinformatics Institute) • http://www.ebi.ac.uk/
• The Canadian Bioinformatics Resource • http://www.cbr.nrc.ca/
• SwissProt/ExPASy (Swiss Bioinformatics Resource) • http://expasy.cbr.nrc.ca/sprot/
• PDB (The Protein Databank) • http://www.rcsb.org/PDB/
A few more resources to be aware of • Human Genome Working Draft • http://genome.ucsc.edu/
• TIGR (The Institute for Genomics Research) • http://www.tigr.org/
• Celera • http://www.celera.com/
• (Model) Organism specific information: • • • • •
Yeast: http://genome-www.stanford.edu/Saccharomyces/ Arabidopis: http://www.tair.org/ Mouse: http://www.jax.org/ Fruitfly: http://www.fruitfly.org/ Nematode: http://www.wormbase.org/
• Nucleic Acids Research Database Issue • http://nar.oupjournals.org/ (First issue every year)
Bioinformatics concerns... • Prediction • Assessment and Comparison • Pattern Recognition • Data Modelling • Optimization • Rendering and Display
• Doing it all on a computer….
17
Applications of Bioinformatics Search for new drugs OH
DNA chips
NH CH2
N Cl
NH2 NH
NH
N
N N
NH2
CH2
OCH3 OCH3
N NH2
NH
N
OCH3
NH2
N NH2
NH Cl
NH2
OH
N
NH2
N
CH2
N
Genetic Variations
H C CH 3 CH3
N
OCH3 OCH3
N
N
NH N
OCH3
NH2 N
NH
N NH2
CH3 CH3
NH
COOOH
N
NH2
Cl
NH
Cl
H C CH 3 CH3
NH OH
COOCOO-
Biochemical Networks Optimizing therapies
data analysis, algorithms, visualization, statistics, etc.
Genomes
Molecular Interactions
Proteins Structure Prediction
caaaaatagggttaatatgaatctcgatctccattttgttcatcgtattcaacaacaagcc aaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtggcgagatatct cttggaaaaactttcaagagcaactcaatcaactttctcgagcattgcttgctcacaatat tgacgtacaagataaaatcgccatttttgcccataatatggaacgttgggttgttcatgaa actttcggtatcaaagatggtttaatgaccactgttcacgcaacgactacaatcgttgaca ttgcgaccttacaaattcgagcaatcacagtgcctatttacgcaaccaatacagcccagca agcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtcggcgatcaagagcaa tacgatcaaacattggaaattgctcatcattgtccaaaattacaaaaaattgtagcaatga aatccaccattcaattacaacaagatcctctttcttgcacttgg
d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ - NLVIMGKKTWFSI d8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ - NAVIMGKKTWFSI d4dfra_ ISLIAALAVDRVIGMENAMPWN - LPADLAWFKRNTL -------NKPVIMGRHTWESI d3dfr__ TAFLWAQDRDGLIGKDGHLPWH - LPDDLHYFRAQTV -------GKIMVVGRRTYESF d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ - NLVIMGKKTWFSI d8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ - NAVIMGKKTWFSI d4dfra_ ISLIAALAVDRVIGMENAMPW- NLPADLAWFKRNTLD -------KPVIMGRHTWESI d3dfr__ TAFLWAQDRNGLIGKDGHLPW - HLPDDLHYFRAQTVG -------KIMVVGRRTYESF
Sequence Analysis
A Matter of Scale* Physics 28 Particles
Chemistry 10,750 Chemicals
~1 trillion trillion Components
Biology
DNA sequences are meaningless! gggtctctcttgttagaccagatctgagcctgggagctctctggctaactagggaacccactgcttaagcctcaataaagcttgccttgagtgcttcaagtagtgtgtgcccgtctgttgtgtgactctgatagctagagatcccttcagaccaaatttagtcagtgtgaaaaatctctagcagtggcgcctgaacagggacttgaaagcgaaa gagaaaccagagaagctctctcgacgcaggactcggcttgctgaagcgcgcacggcaagaggcgaggggacggcgactggtgagtacgccaaaattttgactagcggaggctagaaggagagagatgggtgcgagagcgtcgatattaagcgggggaggattagatagatgggaaaaaattcggttaaggccagggggaaagaaaaaatataga ttaaaacatttagtatgggcaagcagggagctagaacgattcgcagtcaatcctggcctattagaaacatcagaaggttgtagacaaatactgggacaactacaaccagcccttcagacaggatcagaagaacttagatcattatataatacagtagcaaccctctattgtgtgcatcaaaagatagatgtaaaagacaccaaggaagctttag ataagatagaggaagagcaaaacaaaagtaagaaaaaagcacagcaagcagcagctgacacaggaaatagcagccaggtcagccaaaattaccccatagtgcagaacatccaggggcaaatggtacatcaggccatatcacctagaactttaaatgcatgggtaaaagtagtagaagagaaggctttcagcccagaagtaatacccatgttttc agcattatcagaaggagccaccccacaagatttaaacaccatgctaaacacagtggggggacatcaagcagccatgcaaatgttaaaagagaccatcaatgaggaagctgcagaatgggatagattgcatccagtgcatgcagggcctcatccaccaggccagatgagagaaccaaggggaagtgacatagcaggaactactagtacccttcag gaacaaatagcatggatgacaaataatccacctatcccagtaggagaaatctataagagatggataatcctgggattaaataaaatagtaaggatgtatagccctaccagcattctggacataaaacaaggaccaaaggaaccctttagagactatgtagaccggttctataagactctaagagccgagcaagcttcacaggaggtaaaaaatt ggatgacagaaaccttgttggtccaaaatgcgaacccagattgtaagactattttaaaagcattgggaccagcagctacactagaagaaatgatgacagcatgtcagggagtgggaggacccggccataaagcaagagttttggcagaagcaatgagccaagtaacaaattcagctaccataatgatgcagaaaggcaattttaggaaccaaag aaaaattgttaagtgtttcaattgtggcaaagaagggcacatagccaaaaattgcagggcccctaggaaaaggggctgttggaaatgtggaaaggagggacaccaaatgaaagattgtactgagagacaggctaattttttagggaaaatctggccttcccacaggggaaggccagggaattttcctcagaacagactagagccaacagcccca ccagccccaccagaagagagcttcaggtttggggaagagacaacaactccctctcagaagcaggagctgatagacaaggaactgtatccttcagcttccctcaaatcactctttggcaacgaccccttgtcacaataaagataggggggcaactaaaggaagctctattagatacaggagcagatgatacagtattagaagaaataaatttgcc aggaagatggaaaccaaaaatgatagggggaattggaggttttatcaaagtaagacagtatgatcaaatactcgtagaaatctgtggacataaagctataggtacagtattagtaggacctacacctgtcaacataattggaagaaatctgttgactcagattggttgcactttaaattttcccattagtcctattgaaactgtaccagtaaaa ttaaagccaggaatggatggcccaaaagttaaacaatggccattgacagaagaaaaaataaaagcattagtagaaatctgtacagaaatggaaaaggaaggaaaaatttcaaaaatcgggcctgaaaatccatataatactccagtatttgccataaagaaaaaagacagtactaaatggagaaaattagtagatttcagagaacttaataaga aaactcaagacttctgggaagttcaattaggaataccacatcccgcagggttaaaaaagaaaaaatcagtaacagtactggatgtgggtgatgcatatttttcagttcccttagataaagaattcaggaagtacactgcatttaccatacctagtataaacaatgagacaccagggattagatatcagtacaatgtgcttccacagggatggaa aggatcaccagcaatattccaaagcagcatgacaaaaatcttagagccttttagaaaacaaaatccagacatagttatctatcaatacatggacgatttgtatgtaggatctgacttagaaatagggcagcatagaacaaaaatagaggaactgagacaacatctgttgaagtggggatttaccacaccagacaaaaaacatcagaaagaacct ccattcctttggatgggttatgaactccatcctgataaatggacagtacagcctatagtgctgccagaaaaggacagctggactgtcaatgacatacagaagttagtgggaaaattgaattgggcaagtcagatttacccagggattaaagtaaagcaattatgtagactccttaggggaaccaaggcactaacagaagtaataccactaacaa aagaagcagagctagaactggcagaaaacagggaaattctaaaagaaccagtacatggagtgtattatgacccatcaaaagacttaatagcggaaatacagaagcaggggcaaggtcaatggacatatcaaatttatcaagagccatttaaaaatctgaaaacaggaaaatatgcaagaatgaggggtgcccacactaatgatgtaaaacaatt aacagaggcagtgcaaaaaataaccacagaaagcatagtaatatggggaaagactcctaaatttaaactacccatacaaaaagaaacatgggaaacatggtggacagagtattggcaagccacctggattcctgagtgggagtttgtcaatacccctcccttagtaaaattatggtaccagttagagaaagaacccataataggagcagaaact ttctatgtagatggggcagctaacagggagactaaattaggaaaagcaggatatgttactaacaaagggagacaaaaagttgtctccataactgacacaacaaatcagaagactgagttacaagcaattcttctagcattacaggattctggattagaagtaaacatagtaacagactcacaatatgcattaggaatcattcaagcacaaccag ataaaagtgaatcagagatagtcagtcaaataatagagcagttaataaaaaaagaaaaggtctacctgacatgggtaccagcgcacaaaggaattggaggaaatgaacaagtagataaattagtcagtactggaatcaggaaagtactctttttagatggaatagataaagcccaagaagaacatgaaaaatatcacagtaattggagggcaat ggctagtgattttaacctgccacctgtggtagcaaaagagatagtagccagctgtgataaatgtcagctaaaaggagaagccatgcatggacaagtagactgtagtccaggaatatggcaactagattgtacacatttagaaggaaaaattatcctggtagcagttcatgtagccagtggatatatagaagcagaagttattccagcagaaaca gggcaggaaacagcatactttctcttaaaattagcaggaagatggccagtaaaaacagtacatacagacaatggcagcaatttcaccagtactacagttaaggccgcctgttggtgggcaggaatcaagcaggaatttggcattccctacaatccccaaagtcaaggagtagtagaatctataaataaagaattaaagaaagttataggacaga taagagatcaggctgaacatcttaagacagcagtacaaatggcagtattcatccacaattttaaaagaaaaggggggattggggggtacagtgcaggggaaagaatagtagacataatagcaacagacatacaaactaaagaactacaaaaacaaattacaaaaattcaaaattttcgggtttattacagggacagcagagatccactttggaa aggaccagcaaagcttctctggaaaggtgaaggggcagtagtaatacaagataatagtgacataaaagtagtgccaagaagaaaagcaaagatcattagggattatggaaaacagatggcaggtgatgattgtgtggcaagtagacaggatgaggattagaacatggaaaagtttagtaaaacaccatatgtatgtttcaaggaaagctaaggg atggttttatagacatcactatgaaagtactcatccgagaataagttcagaagtacacatcccactagggaatgcaaaattggtaataacaacatattggggtctacatacaggagaaagagactggcatttgggtcaaggagtctccatagaattgaggaaaaggagatatagcacacaattagaccctaacctagcagaccaactaattcat ctgcattactttgattgtttttcagaatctgctataagaaatgccatattaggacatatagttagccctaggtgtgaatatcaagcaggacataacaaggtaggatctctacagtacttggcactaacagcattagtaagaccaagaaaaaagataaagccacctttgcctagtgttacaaaactgacagaggatagatggaacaagccccaga agaccaagggccacaaagggaaccatacaatgaatggacactagaacttttagaggagctcaagaatgaagctgttagacattttcctaggatatggctccatagcttagggcaacatatctatgaaacttatggagatacttgggcaggagtggaagccataataagaattctgcaacaactgctgtttattcatttcagaattgggtgtcaa catagcagaatagacattcttcgacgaaggagagcaagaaatggagccagtagatcctagactagagccctggaagcatccaggaagtcagcctaggactgcttgtaccaattgctattgtaaaaagtgttgctttcattgccaagtttgtttcataacaaaaggcttaggcatctcctatggcaggaagaagcggagacagcgacgaagagct cctcaagacagtcagactcatcaagtttctctatcaaagcagtaagtagtacatgtaatgcaatctttacaaatattagcagtagtagcattagtagtagcagcaataatagcaatagttgtgtggtccatagtattcatagaatataggaaaataagaagacaaaacaaaatagaaaggttgattgatagaataatagaaagagcagaagaca gtggcaatgagagtgacggagatcaggaagaattatcagcacttgtggaaatggggcacgatgctccttgggatgttaatgatctgtaaagctgcagaaaatttgtgggtcacagtttattatggggtacctgtgtggaaagaagcaaccaccactctattttgtgcctcagatgctaaagcgtatgatacagaggtacataatgtttgggcca cacatgcctgtgtacccacagaccccaacccacaagaagtagaactgaagaatgtgacagaaaattttaacatgtggaaaaataacatggtagaccaaatgcatgaggatataattagtttatgggatcaaagcctaaagccatgtgtaaaattaaccccactctgtgttactttaaattgcactgattatgggaatgatactaacaccaataa tagtagtgctactaaccccactagtagtagcgggggaatggaggggagaggagaaataaaaaattgctctttcaatatcaccagaagcataagagataaagtgaagaaagaatatgcacttttttatagtcttgatgtaataccaataaaagatgataatactagctataggttgagaagttgtaacacctcagtcattacacaggcctgtcca aaggtatcctttgaaccaattcccatacattattgtgccccggctggttttgcgattctaaagtgtaatgataaaaagttcaatggaaaaggaccatgtacaaatgtcagcacagtacaatgtacacatggaattaggccagtagtatcaactcaactgctgttaaatggcagtctagcagaagaagaggtagtaattagatcagacaatttct cggacaatgctaaagtcataatagtacatctgaatgaatctgtagaaattaattgtacaagactcaacaacattacaaggagaagtatacatgtaggacatgtaggaccaggcagagcaatttatacaacaggaataataggaaaaataagacaagcacattgtaacattagtagagcaaaatggaataacactttaaaacagatagttacaaa attaagagaacaatttaagaataaaacaatagtctttaatcaatcctcaggaggggacccagaaattgtaatgcacagttttaattgtggaggggaatttttctactgtaattcaacacaactgtttaacagtacttggaatggtactgcatggtcaaataacactgaaggaaatgaaaatgacacaatcacactcccatgcagaataaaacaa attataaacatgtggcaggaagtaggaaaagcaatgtatgcacctcccatcagaggacaaattagatgttcatcaaatattacagggctgatattaacaagagatggtggtattaaccagaccaacaccaccgagattttcaggcctggaggaggagatatgaaggacaattggagaagtgaattatataaatataaagtagtaaaaattgaac cattaggagtagcacccaccaaggcaaagagaagagtggtgcaaagagaaaaaagagcagtgggaataataggagctatgctccttgggttcttgggagcagcaggaagcactatgggcgcagcgtcaatgacgctgacggtacaggccagacaattattgtctggtatagtgcaacagcagaacaatttgctgagggctattgaggcgcaaca gcatctgttgcacctcacagtctggggcatcaagcagctccaagcaagagtcctggctgtggaaagatacctaagggatcaacagctcctggggttttggggttgctctggaaaactcatttgcaccactgctgtgccttggaatactagttggagtaataaatctctgagtcagatttgggataacatgacctggatgcagtgggaaagggaa attgataattacacaagcttaatatacaacttaattgaagaatcgcaaaaccaacaagaaaagaatgaacaagagttattggaattagataactgggcaagtttgtggaattggtttagcataacaaattggctgtggtatataaaaatattcataatgatagtaggaggcttggtaggtttaagaatagtttttactgtactttctatagtaa atagagttaggcagggatactcaccattgtcgtttcagacgcgcctcccagccaggaggggacccgacaggcccgaaggaatcgaagaagaaggtggagagagagacagagacagatccggtcaattagtggatggattcttagcaattatctgggtcgacctgcggagcctgtgcctcttcagctaccaccgcttgagagacttactcttgat tgtaacgaggattgtggaacttctgggacgcagggggtgggaagccctcaaatattggtggaatctcctacaatattggattcaggaactaaagaatagtgctgttagcttgctcaacgccacagccatagcagtagctgagggaactgatagggttatagaagtattacaaagagcttgtagagctattctccacatacctagaagaataaga cagggcttagaaagggctttgcaataagatgggtggtaagtggtcaaaaagtagtaaaattggatggcctactgtaagggaaagaatgagaagagctgagccagcagcagatggggtgggagcagtatctcgagacctggaaaaacatggagcaatcacaagtagtaatacagcaactaacaatgctgattgtgcctggctagaagcacaagag gaggaggaggtgggttttccagtcagacctcaggtacctttaagaccaatgacttacaagggagcgttagatcttagccactttttaaaagaaaaggggggactggaagggctaatttggtcccagaaaagacaagacatccttgatttgtgggtccaccacacacaaggctacttccctgattggcagaactacacaccagggccagggatca gatatccactgacctttggttggtgcttcaagctagtaccagttgagccagagaaggtagaagaggccaatgaaggagagaacaacagattgttacaccctgtgagcctgcatgggatggaggacccggagaaagaagtgttagtatggaggtttgacagccgcctagtactccgtcacatggcccgagagctgcatccggagtactacaagga ctgctgacactgagctttctacaagggactttccgctggggactttccagggaggcgtggcctgggcgggactggggagtggcgagccctcagatgctgcatataagcagctgctttttgcctgtactgggtctctcttgttagaccagatctgagcctgggagctctctggctaactagggaacccactgcttaagcctcaataaagcttgcc ttgagtgcttca
From gene to protein and its function(s) Gene
> DNA sequence AATTCATGAAAATCGTATACTGGTCTGGTACCGGCAACAC TGAGAAAATGGCAGAGCTCATCGCTAAAGGTATCATCGAA TCTGGTAAAGACGTCAACACCATCAACGTGTCTGACGTTA ACATCGATGAACTGCTGAACGAAGATATCCTGATCCTGGG TTGCTCTGCCATGGGCGATGAAGTTCTCGAGGAAAGCGAA TTTGAACCGTTCATCGAAGAGATCTCTACCAAAATCTCTG GTAAGAAGGTTGCGCTGTTCGGTTCTTACGGTTGGGGCGA CGGTAAGTGGATGCGTGACTTCGAAGAACGTATGAACGGC TACGGTTGCGTTGTTGTTGAGACCCCGCTGATCGTTCAGA ACGAGCCGGACGAAGCTGAGCAGGACTGCATCGAATTTGG TAAGAAGATCGCGAACATCTAGTAGA
Function
> Protein sequence MKIVYWSGTGNTEKMAELIAKGIIESGKDVNTINVS DVNIDELLNEDILILGCSAMGDEVLEESEFEPFIEEIS TKISGKKVALFGSYGWGDGKWMRDFEERMNGYG CVVVETPLIVQNEPDEAEQDCIEFGKKIANI
Goals of Functional Genomics
What is the function of these structures?
What is the function of this sequence?
What is the function of this motif? –
–
the fold provides a scaffold, which can be decorated in different ways by different sequences to confer different functions knowing the fold & function allows us to rationalise how the structure effects its function at the molecular level
Bioinformatics is needed in all levels
DNA
5’
3’
Transcription Splicing
mRNA
Translation Poly-peptide
Folding
Protein
Function
Function
• Transport / Localization • Oligomerization • Post-Translational Modification
At Genome Level Genome Projects need to store and organize DNA sequences DNA
5’
3’
Transcription Splicing
mRNA
Translation Poly-peptide
Folding
Protein
Function
Function
• Transport / Localization • Oligomerization • Post-Translational Modification
At Transcription Level
DNA
How do we find protein coding regions, introns and exons in genomic 3’ DNA sequences?
5’
Transcription Splicing
mRNA
Translation Poly-peptide
Folding
Protein
Function
Function
• Transport / Localization • Oligomerization • Post-Translational Modification
At Transcription Level Under which condition is a certain gene transcribed? DNA
5’
3’
Transcription Splicing
mRNA
Translation Poly-peptide
Folding
Protein
Function
Function
• Transport / Localization • Oligomerization • Post-Translational Modification
At Translation Level
DNA
5’
3’
Transcription Splicing
mRNA
Translation Poly-peptide
Folding What do we know about a specific protein?
Protein
Function
Function
• Transport / Localization • Oligomerization • Post-Translational Modification
At Translation Level
DNA
5’
3’
Transcription Splicing
mRNA
Translation Poly-peptide
Folding How can we compare protein sequences?
Protein
Function
Function
• Transport / Localization • Oligomerization • Post-Translational Modification
At Structure Level
DNA
5’
3’
Transcription Splicing
mRNA
Translation Poly-peptide
Folding Can we predict protein structures?
Protein
Function
Function
• Transport / Localization • Oligomerization • Post-Translational Modification
Application and Use of databases Homology searching:Use of knowledge from other often more well described organisms such as the model organisms Mouse, Drosophila, Fugu, C.Elegans etc.. Sequence level – position, annotation Structural level – proteins, RNA Evolutionary analyses:Phylogenetics Population genetics Molecular evolution of genetic elements Genome evolution Primer design Microarray design Drug design Many more……
Web Access: http://www.ncbi.nlm.nih.gov
Text Searches Entrez System
Sequence Retrieval System (SRS)
BLAST: Sequence Similarity Searches
VAST: Structure Similarity Searches
BLAST: Basic Local Alignment Search Tool
NCBI BLAST http://www.ncbi.nlm.nih.gov/blast/
June 26, 2000 at the Whitehouse
In post genomic era- a new language has been created for new biology
• • • • •
Genomics Functional Genomics Proteomics cDNA microarrays Global Gene Expression Patterns
New Computational Tools are Needed • • • • • • • •
Sequencing Analyzing experimental data Representing vast quantities of information Searching Pattern matching Data mining Gene discovery Function discovery
High-throughput Technologies ➢Next Generation Sequencing (NGS) ➢Virtual Screening ➢Genotyping ➢SNP discovery ➢Gene express
High-throughput techniques High throughput protein crystalization Massive parallel sequencing
Mass spectrometry Microarray High throughput cell imaging
High throughput in vivo screening …
How to extract the information? Computational tools • Building the databases • Perform analysis/extract features
• Data mining • Classification/statistical learning • Visualization/representation
Biological information!!!
High throughput Instruments PCR
NGS
PG Era-High throughput DNA sequencing Centre
Sequencing factories
Custom-designed factory-style conveyor belt robots: Perform all functions from purifying DNA from bacterial cultures through setting up and purifying sequencing reactions.
Whitehead institute
Automated sequencing
DNA isolation
Mass spectrometer
Beckman Biomek FX
Sequencing
Affymetrix gene expression
Huge data -Biology • Consider biology: huge data are attainable since new automation is available: robotics, new chips, sequencing, imaging, etc. • Sensors measure everything from speed to smell.
Generation of Data ➢ Raw data from sequencing ➢ Expression data ➢ Data generated by linking other raw data in very large, multidimensional databases (e.g., OMIM) ➢ Research literature (full-text journals) ➢ Data models to describe the literature for retrieval, linking to other data, and linking to the raw data ➢ New data models to support greater flexibility in describing & manipulating data …
OVERLOADS OF OMICS INFORMATION Life Sciences have been highly affected by the generation of large data sets, specifically by overloads of omics information -genomes, -transcriptomes, -Epigenomes & other omics data from cells, tissues and organisms
NGS Platforms- data generations • Next-Generation Sequencing (NGS) platforms that use semiconductors or nanotechnology have exponentially increased the rate of biological data generation in the last two years.
Real-time/Fast Data
Mobile devices (tracking all objects all the time)
Scientific instruments (collecting all sorts of data)
Sensor technology and networks (measuring all kinds of data)
• The progress and innovation is no longer hindered by the ability to collect data • But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a 55 scalable fashion
Next (second) Generation Sequencing • New technologies allowing the massive production of tens of millions of short sequencing fragments. Thus, it is also called: “Massively parallel sequencing” • These techniques could be used to – deal with similar problems than microarrays,
• They raised the promise of personalized medicine Setia Pramana
59
NGS • The advent of high-throughput sequencing technologies has initiated the ‘personal genome sequencing’ era for both normal and cancer genomes ➢Large-scale international projects such as the 1000 Genomes Project and ➢`the International Cancer Genome Consortium Setia Pramana
60
NGS • NGS technologies have been on the market only since 2004 • Have now largely replaced Sanger sequencing technologies (owing to the ultra-highthroughput production/hundreds gigabases) • Ability to simultaneously sequence millions of DNA fragments - massively parallel sequencing technologies Setia Pramana
61
NGS - The Big Picture ➢8.7 million species in the world (estimate) ➢• ~ 7 billion people ➢• Sequencers exist in both large centres & small research groups ➢• 200+ Ilumina HiSeq sequencers in Europe alone ➢• capacity to sequence 1600 human genomes / month ➢• Largest centre: Beijing Genomics Institute (BGI) ➢• ~140 HiSeq ➢• ~1500 Hiseq devices worldwide today ➢• 3-6 PB / day ➢• 1.1 – 2.2 ExaBbytes / year
Next Generation Sequencing (NGS) Revolution
Huge Data Storage and HPC Demand
NGS Challenges • Highest cost is not the sequencing but storage and analysis. • A standard human (30-40x) whole genome sequencing would create 100 Gb of data • Extreme data size causes problems • Just transferring and storing the data • Standard tools can not be used • Think in fast and parallel programs Setia Pramana
Growth of WGS in Gene bank
Gene bank---
The trend of data growth century is a century of biotechnology:
Genomics: New sequence information is being
produced at increasing rates. (The contents of GenBank double every year)
Nucleotides(billion)
21st
8 7 6 5 4 3 2 1 0 1980
1985
1990
Years
Microarray: Global expression analysis: RNA levels of every gene in the genome analyzed in parallel.
Proteomics:Global protein analysis generates by large mass spectra libraries.
Metabolomics:Global metabolite analysis: 25,000 secondary metabolites characterized
Glycomics:Global sugar metabolism analysis
1995
2000
The Challenges of Molecular Biology Computing • The big dataset problem • DNA sequencing • Pairwise and Multiple Alignments • Similarity searching the databanks • Structure-function relationships; Can sequence patterns predict function? • Phylogenetic analysis: Sequence conservation across evolution • Genomics
Development of Algorithms for Sequence Comparison: • Phylogenetic Algorithm • Complex mathematical formula used to determine sequence homology • All possible ways a large number of sequences can be compared to one another • Fitch and Margoliash • Sequence comparisons to determine evolutionary trees • Computer calculates the minimum number of steps to convert one sequence to another and builds possible trees • Needleman and Wunsch • Similarities in protein sequences
Major breakthroughs in Bioinformatics Through innovations in Mathematics or Statistics FASTA, BLAST, Phred/Phrap, BLOSUM, GenScan, PSI-BLAST, Threading, GRAIL etc.
Data at Petabyte scale
Petabyte Scale Data Generation
1024 Yottabytes
1021
Theater Data Stream (2006): ~270 TB of NTM data / year Example: One Theater’s Storage Capacity:
Zettabytes
1018
250 TB
Exabytes
Large Data JCTD
12 TB 2006
Capability Gap
2010
1015 Petabytes
UUVs FIRESCOUT VTUAV DATA
1012 Terabytes
GIG Data Capacity (Services, Transport & Storage) 2000
Today
2010
2015 & Beyond
Bob Gourley http://ctovision.typepad.com/InfoSharingTechnologyFutures.ppt
A New Era Of Analytic
The Myth About Big Data
Big Data Is New Big Data Is Only About Massive Data Volume Big Data Means Hadoop Big Data Need A Data Warehouse Big Data Means Unstructured Data Big Data Is for Social Media & Sentiment Analysis
BIG DATA • Big data refers to very large datasets with complex structures that are difficult to process using traditional methods and tools. • The term process includes, capture, storage, formatting, extraction, curation, integration, analysis, and visualization • Big data and analytics are intertwined
Big Data Born
• Google, eBay, LinkedIn, and Facebook were built around Big Data from the beginning. • Big Data could stand alone. • Big Data analytics could be the only focus of analytics
Four Characteristics of Big Data Cost efficiently processing the growing Volume 50x
35 ZB
Responding to the increasing Velocity
30 Billion RFID sensors and counting
Collectively Analyzing the broadening
Variety 80% of the worlds data is unstructured
2010
2020
Establishing the Veracity of big data sources
Where Is This “Big Data” Coming From ? 4.6 billion camera phones world wide
100s of millions of GPS enabled
data every day
? TBs of
12+ TBs of tweet data every day
30 billion RFID tags today (1.3B in 2005)
devices sold annually
2+ billion
25+ TBs of log data every day 76 million smart meters in 2009… 200M by 2014
people on the Web by end 2011
Deluge of Data ➢Data is now considered the Fourth Paradigm in Science ➢-the first three paradigms were experimental, theoretical and computational science. ➢-This shift is being driven by the rapid growth in data from improvements in scientific instruments
Transfer of data from one location to another • shipping external hard disks • – processing the data while it is being transferred • – Future? Data won’t be moved!
Cloud computing-• Increased need to store data • Cloud–based computing solutions have emerged
DERIVING VALUE VIA HARNESSING ➢ VOLUME, ➢ VARIETY ➢ VELOCITY
The big data problem: The big data problem:
In the end it is a Computing Challenge
Computing Challenge Researchers need to crunch a large amount of data very quickly (and easily) using high-performance computers
Example: A de novo assembly algorithm for DNA data finds reads whose sequences “overlap” and records those overlaps in a huge diagram called an assembly graph. ➢ For a large genome, this graph can occupy many terabytes of RAM, and ➢ completing the genome sequence can require weeks or months of computation on a world-class supercomputer.
data deluge appears in genomics • The DNA data deluge comes from thousands of Sources • More than 2000 sequencing instruments around the world – more than 15 petabytes x year of genetic data. And soon, tens of thousands of sources!!!!
The total computing burden is growing • Computing, not sequencing, is now the slower and more costly aspect of genomics research
How can we help • What is clear is that it will involve both better algorithms • a renewed focus on such “big data” approaches in managing and processing data.
What is the time required to retrieve information? • Assume 100MB/sec • scanning 1 Terabyte:more than 5 hours
Solution? • massive parallelism not only in computation but also in storage • How do companies like google read and process data from 10.000 disks in parallel
New programming model • To meet the challenges: MapReduce – Programming Model introduced by Google in early 2000s to support distributed computing • Ecosystem of big data processing tools – open source, distributed, and run on commodity hardware.
Computational Facility • Current computer systems available at genomics research institutions are commonly designed to run general computational jobs where, • – Traditionally the limiting resource is the use of CPU. • – Also, we find a large common storage space shared for all nodes.
Important: Remote Nodes are Closer Interconnects have become much faster
Big Data Analytics • Big data is more real-time in nature than traditional DW applications • Traditional DW architectures (e.g. Exadata, Teradata) are not well-suited for big data applications • Massively parallel processing, scale out architectures are well-suited for big data apps 107
Why Big Data • Growth of Big Data needs ➢ Increase of storage capacities ➢ Increase of processing power ➢ Availability of data(different data types) ➢ Every day we create 2.5 quintillion bytes of data; 90% of the data in the world today has been created in the last two years alone
How Is Big Data Different? 1) Automatically generated by a machine (e.g. Sensor embedded in an engine) 2) Typically an entirely new source of data (e.g. Use of the internet) 3) Not designed to be friendly (e.g. Text streams) 4) May not have much values • Need to focus on the important part
109
Estimates of Big Data volume • In 2020, there will be 35 zettabytes of digital data. • That represents a stack of DVD’s that would reach half way from the Earth to Mars. • Facebook has 70 petabytes and 2700 multiprocessor nodes. • The Bing search engine has 150 petabytes and 40,000 nodes. • Big Data: Techniques and Technologies that Make Handling Data at Extreme Scale Economical.
Why do we care about Big Data? • Data is the new oil – we have to learn how to mine it! Qatar – European Commission Report • $ 7 trillion economic value in 7 US sectors alone • New McKinsey 4th factor of production :Land, Labor, Capital, + Data • Gartner: hundreds of billions of GDP by 2020 • Big Data: new driver for digital economy & society • $90 B annually in sensitive devices
Current Situation • Soon it will cost less to sequence a base of DNA than to store it on hard disk • We are facing potential tsunami of genome data that will swamp our storage system and crush our compute clusters
The research in omics sciences is moving from a hypothesis-driven to a data-driven approach • Efficient analysis and interpretation of Big Data opens new avenues to explore molecular biology • New paradigms are needed to store and access data, for its annotation and integration and finally for inferring knowledge and making it available to researchers. • A clear awareness of present high performance computing (HPC) solutions in bioinformatics is need of the day • Big Data analysis paradigms for computational biology
The Human Genome Project • First draft genome of human in 2001, final 2004 • Estimated costs $3 billion, time 13 years • Used Sanger Sequencing • Today: Illumina: 1 week, 9500$ Exome: 6 weeks*, $1000 Towards 1000$ genome Setia Pramana
116
Human Genome Project ➢EBI is building a cloud-based infrastructure called Helix Nebula — The Science Cloud. ➢In biomedicine the Human Genome Project is determining the sequences of the three billion chemical base pairs that make up human DNA.
Impact • Decoding the human genome originally took 10 years to process; now it can be achieved in a few hours • human genome was a $3 billion dollar project requiring over a decade to complete in 2002. • Now we are able to sequence and analyze an entire genome for less than a thousand dollars.
Big Data and Web 2.0 • O’Reilly Media introduced the term “Big Data” a year after Web 2.0 appeared, as many valuable Big Data situations are indeed related to consumer conduct. • Web 2.0 provided the impulse to rethink the interaction that was taking place on Internet, and to push it somewhat further.
Potential Value & Benefits of Big data Analytics
Big data analytics can accumulate ✓ the
wisdom of crowds
✓ reveal ✓
patterns
yield best practices
Types of tools used in Big-Data • Where processing is hosted? – Distributed Servers / Cloud (e.g. Amazon EC2)
• Where data is stored? – Distributed Storage (e.g. Amazon S3)
• What is the programming model? – Distributed Processing (e.g. MapReduce)
• How data is stored & indexed? – High-performance schema-free databases (e.g. MongoDB)
• What operations are performed on data? – Analytic / Semantic Processing
Major Big Data Technologies Several technologies and frameworks have been deveoped to handle big data like • Hadoop (or Apache Hadoop) – This is by far the most popular Big Data tool -an open source platform -major applications of Hadoop is to process and manage large volumes of persistently changing data. • Map Reduce- This is a foundation framework for Hadoop. It allows handling of massive volumes of data in parallel distributed processing environment. • No SQL- These are referred to as not only SQL databases, which are very different from the traditional “relational” databases. Unlike the relational databases, No SQL does not require any specific table schemas for data handling. • Grid Computing – This is a special type of distributed computing where a connection is established between multiple geographically dispersed computer sources. These resources operate in parallel to handle large chunks of data. • In-memory databases – These are databases which use the main memory of the system for data processing. These are used in systems where response time and data requests are considerably high. • Specialized databases – These are big databases which manage and process data providing specific information.
The Techniques and Methods of Analytics Learning analytics draws upon techniques from a number of established fields: – Statistics – Artificial Intelligence – Machine Learning – Data mining – Social Network Analysis – Text Mining and Web Analytics – Operational Research – Information Visualization
Benefits • Cost & management – Economies of scale, “out-sourced” resource management
• Reduced Time to deployment – Ease of assembly, works “out of the box”
• Scaling – On demand provisioning, co-locate data and compute
• Reliability – Massive, redundant, shared resources
• Sustainability – Hardware not owned
Cloud Computing • Computing - Which Dynamically Scalable • And Virtualized Resources Are Provided As A Service Over The Internet”.
Cloud Computing • IT resources provided as a service – Compute, storage, databases, queues
• Clouds leverage economies of scale of commodity hardware – Cheap storage, high bandwidth networks & multicore processors – Geographically distributed data centers
• Offerings from Microsoft, Amazon, Google, …
wikipedia:Cloud Computing
Types of Cloud Computing • Public Cloud: Computing infrastructure is hosted at the vendor’s premises. • Private Cloud: Computing architecture is dedicated to the customer and is not shared with other organisations. • Hybrid Cloud: Organisations host some critical, secure applications in private clouds. The not so critical applications are hosted in the public cloud – Cloud bursting: the organisation uses its own infrastructure for normal usage, but cloud is used for peak loads.
• Community Cloud
Classification of Cloud Computing based on Service Provided • Infrastructure as a service (IaaS) – Offering hardware related services using the principles of cloud computing. These could include storage services (database or disk storage) or virtual servers. – Amazon EC2, Amazon S3, Rackspace Cloud Servers and Flexiscale.
• Platform as a Service (PaaS)
•
– Offering a development platform on the cloud. – Google’s Application Engine, Microsofts Azure, Salesforce.com’s force.com . Software as a service (SaaS) – Including a complete software offering on the cloud. Users can access a software application hosted by the cloud vendor on payper-use basis. This is a well-established sector. – Salesforce.coms’ offering in the online Customer Relationship Management (CRM) space, Googles gmail and Microsofts hotmail, Google docs.
Infrastructure as a Service (IaaS)
Cloud Platforms • • • • •
Amazon Elastic Compute Cloud Google App Engine Microsoft Azure GoGrid AppNexus
Bioinformatics Cloud
CloudBurst • Running time on EC2 High-Medium Instance Cluster • Comparison of CloudBurst running time while scaling the size of the cluster for mapping 7M reads to human chromosome 22 with at most four mismatches on the EC2 Cluster. • The 96- core cluster is 3.5X faster than the 24- core cluster. • [CloudBurst: highly sensitive read mapping with Map)
EBI-Cloud • The EBI is building a cloud-based infrastructure called Helix Nebula — The Science Cloud. • Clouds are a solution, but they also throw up fresh challenges • Harnessing powerful computers and numerous tools for data analysis is crucial in drug discovery and other areas of big-data biology
Bioinformatics applications • Alu Sequence Classification is one of the most challenging problems for sequence clustering because Alus represent the largest repeat families in human genome. • EST (Expressed Sequence Tag) corresponds to messenger RNAs transcribed from genes residing on chromosomes. • Each individual EST sequence represents a fragment of mRNA, and the EST assembly aims to reconstruct fulllength mRNA sequence for each expressed gene.
The distinction between big data technologies and cloud computing
• - cloud computing is often used to facilitate the cost effective storage of such large datasets. - big data technologies are often offered as Platform as a Service (PaaS) within a cloud environment. - These technologies, while often coinciding, are distinct and can operate mutually exclusively
MapReduce • One of the first MapReduce projects applied in the biotechnology space resulted in the Genome Analysis Tool Kit (GATK) . • CloudBurst was one of the first of these, developed by Michael Schatz et al. • Crossbow for SNP identification, using Hadoop’s massive sort engine to order the alignments along the genome and then genotyping sample using SoapSNP. • As input it aligns a mix of 3 billion paired-end and unpaired reads, equivalent to 110 GB of compressed sequence data, and as output it catalogues all the SNPs in the genome. • CrossBow can genotype a human in approximately 3 hours on a 320 core cluster, discovering 3.7 million SNPs at >99% accuracy for $100 using AWS EC2.
Cloud BioLinux • Cloud BioLinux, created by the J. Craig Venter Institute (JCVI) is an example of SaaS. • It is a publicly accessible virtual machine that is stored at Amazon EC2, is freely available • User friendly Graphical User Interface (GUI), along with over 100 pre-installed bioinformatics tools including Galaxy , BioPerl, BLAST, Bioconductor, Glimmer, GeneSpring, ClustalW and EMBOSS utilities, amongst others. • While Linux based Bioinformatics distributions such DNALinux, BioSlax ,BioKnoppix, DebianMed, are not unusual, they are built to run on standalone local machines. • SaaS initiatives such as BioLinux have been known to be referred to as Science as a Service (ScaaS).
SaaS VM • Another significant advantage of using such SaaS VM images on a public cloud, such as Amazon, is that Amazon provides access to several large genomic data sets including • the 1000 Genome project, • NCBI, GenBank and Ensembl. • CloVR provides a similar image with pre-installed packages. • Standalone bio/medical software applications/suites with a cloud backend include Taverna , FX ,SeqWare ,BioVLab , and commercial equivalents such as DNAnexus .
Parallelised big data technologies and genomics Genomics research is the the dream use case for big data technologies which, if unified, are likely to have a profoundly positive impact on mankind
The Data Deluge – Wired 16.07
Paradigm shift Old Compute-centric Model(L) to New Data-centric Model
PROTEOMICS in Aquaculture: Applications and trends • Omic technologies (i.e., genomics,metabolomics and proteomics) have been widely implemented in the field of farm animal proteomics with a very positive impact in areas such as aquaculture. • Proteomics, as a powerful comparative tool, has been increasingly used over the last decade to address different questions in aquaculture, regarding welfare, nutrition,health, quality, and safety.
• A surprising wealth of tools has already been generated for genome mapping and functional studies in many of the species used in aquaculture. • With the potential for sequencing on the horizon, the future is bright for aquaculture genomics.
.
Recommendations • Success in Aquaculture research dealing with the increasing amounts of omics data combined with clinical information will depend on our ability to interpret high scale data sets that are generated by emerging technologies
Recommendations • Private companies such as Microsoft, Oracle, Amazon, Google, Facebook and Twitter are masters in dealing with petabyte scale data sets. • Aquaculture will need to implement the same type of scalable structure to deal with volumes of data generated by omics technologies. • The life sciences will need to adapt to the advances in informatics to successfully address the Big Data problems that will be faced in the next decade.
Big data and privacy concerns
Privacy Concern--Accoding to some experts, the ability to protect medical and genomics data in the era of big data and a changing privacy landscape is a growing challenge.
The growing globalization of data flows via big data increases the risk that people can lose control of their own data
Celia Fernández Aller ETSISI UPM
THANK U for Patient hearing