Cloud-based Services

Trends in Computational Biology and Bioinformatics in the Era of Big Data Analytics

Ajit Kumar Roy • Ex. National Consultant (IA) for East and North East Region, NAIP,ICAR • Ex. Consultant (Statistics), College of Fisheries, CAU,, Agartala • Ex. Principal Scientist & Co-coordinator, Bioinformatics Centre of CIFA (ICAR), Bhubaneswar; • Ex. Computer Specialist, SAARC Agricultural Information Centre,, Dhaka, Bangladesh

Presented at 'International Workshop on Bioinformatics in Fisheries and Aquaculture’ held at ICAR- CIFRI, Barrackpore during 19-21 June 2017

Contents ➢ Basic concept about Bioinformatics ➢ Advances in bioinformatics /Computational Biology ➢ Growth of Genomic & Proteomics data Data ➢ Introduction & basics of big data ➢ Cloud Computing ➢ Potential value & benefit of big data analytics ➢ Application of Cloud Computing in Bioinformatics ➢ Big data and privacy concerns

Bioinformatics BIOlogy

INFORmatics

matheMATICS

Interactions of Disciplines

….

Bioinformatics • The analysis of large volumes of genomic, proteomic and metabolomic data requiring sophisticated algorithms and powerful computers • Rapidly evolving field with an extreme shortage of skilled workers to write programs and analyze data

Goals of bioinformatics • • • • • • •

Classify Identify patterns/ Pattern Recognition Make predictions Data Modelling,Creation of models & Prediction Assessment and Comparison Optimization Better utilize existing knowledge

Bioinformatic Goals • To understand integrative aspects of the biology of organisms, viewed as coherent complex structures • To interrelate sequences, 3-D structures, interactions and functions of proteins, nucleic acids and their complexes • To study the evolution of biological systems • To support applications in agricultural, pharmaceutical and other scientific fields

Dr.S.Parthasarathy, NIT, Trichy

9

21/6/2017

System Biology

sys

tem

iolo

gy mo

t

s ure

uct

genomics

str

proteomics

lar

transcriptomics

metabolomics

u lec

e ne n e g

ks wor

sb

Genomics, Proteomics & Systems Biology

Genomics

Proteomics

Systems Biology

1990

1995

2000

2005

2010

2015

2020

Examples of Bioinformatics • Database interfaces • Genbank/EMBL/DDBJ, Medline, SwissProt, PDB, …

• Sequence alignment • BLAST, FASTA

• Multiple sequence alignment • Clustal, MultAlin, DiAlign

• Gene finding • Genscan, GenomeScan, GeneMark, GRAIL

• Protein Domain analysis and identification • pfam, BLOCKS, ProDom,

• Pattern Identification/Characterization • Gibbs Sampler, AlignACE, MEME

• Protein Folding prediction • PredictProtein, SwissModeler

Five websites that all biologists should know • NCBI (The National Center for Biotechnology Information; • http://www.ncbi.nlm.nih.gov/

• EBI (The European Bioinformatics Institute) • http://www.ebi.ac.uk/

• The Canadian Bioinformatics Resource • http://www.cbr.nrc.ca/

• SwissProt/ExPASy (Swiss Bioinformatics Resource) • http://expasy.cbr.nrc.ca/sprot/

• PDB (The Protein Databank) • http://www.rcsb.org/PDB/

A few more resources to be aware of • Human Genome Working Draft • http://genome.ucsc.edu/

• TIGR (The Institute for Genomics Research) • http://www.tigr.org/

• Celera • http://www.celera.com/

• (Model) Organism specific information: • • • • •

Yeast: http://genome-www.stanford.edu/Saccharomyces/ Arabidopis: http://www.tair.org/ Mouse: http://www.jax.org/ Fruitfly: http://www.fruitfly.org/ Nematode: http://www.wormbase.org/

• Nucleic Acids Research Database Issue • http://nar.oupjournals.org/ (First issue every year)

Bioinformatics concerns... • Prediction • Assessment and Comparison • Pattern Recognition • Data Modelling • Optimization • Rendering and Display

• Doing it all on a computer….

17

Applications of Bioinformatics Search for new drugs OH

DNA chips

NH CH2

N Cl

NH2 NH

NH

N

N N

NH2

CH2

OCH3 OCH3

N NH2

NH

N

OCH3

NH2

N NH2

NH Cl

NH2

OH

N

NH2

N

CH2

N

Genetic Variations

H C CH 3 CH3

N

OCH3 OCH3

N

N

NH N

OCH3

NH2 N

NH

N NH2

CH3 CH3

NH

COOOH

N

NH2

Cl

NH

Cl

H C CH 3 CH3

NH OH

COOCOO-

Biochemical Networks Optimizing therapies

data analysis, algorithms, visualization, statistics, etc.

Genomes

Molecular Interactions

Proteins Structure Prediction

caaaaatagggttaatatgaatctcgatctccattttgttcatcgtattcaacaacaagcc aaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtggcgagatatct cttggaaaaactttcaagagcaactcaatcaactttctcgagcattgcttgctcacaatat tgacgtacaagataaaatcgccatttttgcccataatatggaacgttgggttgttcatgaa actttcggtatcaaagatggtttaatgaccactgttcacgcaacgactacaatcgttgaca ttgcgaccttacaaattcgagcaatcacagtgcctatttacgcaaccaatacagcccagca agcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtcggcgatcaagagcaa tacgatcaaacattggaaattgctcatcattgtccaaaattacaaaaaattgtagcaatga aatccaccattcaattacaacaagatcctctttcttgcacttgg

d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ - NLVIMGKKTWFSI d8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ - NAVIMGKKTWFSI d4dfra_ ISLIAALAVDRVIGMENAMPWN - LPADLAWFKRNTL -------NKPVIMGRHTWESI d3dfr__ TAFLWAQDRDGLIGKDGHLPWH - LPDDLHYFRAQTV -------GKIMVVGRRTYESF d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ - NLVIMGKKTWFSI d8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ - NAVIMGKKTWFSI d4dfra_ ISLIAALAVDRVIGMENAMPW- NLPADLAWFKRNTLD -------KPVIMGRHTWESI d3dfr__ TAFLWAQDRNGLIGKDGHLPW - HLPDDLHYFRAQTVG -------KIMVVGRRTYESF

Sequence Analysis

A Matter of Scale* Physics 28 Particles

Chemistry 10,750 Chemicals

~1 trillion trillion Components

Biology

DNA sequences are meaningless! gggtctctcttgttagaccagatctgagcctgggagctctctggctaactagggaacccactgcttaagcctcaataaagcttgccttgagtgcttcaagtagtgtgtgcccgtctgttgtgtgactctgatagctagagatcccttcagaccaaatttagtcagtgtgaaaaatctctagcagtggcgcctgaacagggacttgaaagcgaaa gagaaaccagagaagctctctcgacgcaggactcggcttgctgaagcgcgcacggcaagaggcgaggggacggcgactggtgagtacgccaaaattttgactagcggaggctagaaggagagagatgggtgcgagagcgtcgatattaagcgggggaggattagatagatgggaaaaaattcggttaaggccagggggaaagaaaaaatataga ttaaaacatttagtatgggcaagcagggagctagaacgattcgcagtcaatcctggcctattagaaacatcagaaggttgtagacaaatactgggacaactacaaccagcccttcagacaggatcagaagaacttagatcattatataatacagtagcaaccctctattgtgtgcatcaaaagatagatgtaaaagacaccaaggaagctttag ataagatagaggaagagcaaaacaaaagtaagaaaaaagcacagcaagcagcagctgacacaggaaatagcagccaggtcagccaaaattaccccatagtgcagaacatccaggggcaaatggtacatcaggccatatcacctagaactttaaatgcatgggtaaaagtagtagaagagaaggctttcagcccagaagtaatacccatgttttc agcattatcagaaggagccaccccacaagatttaaacaccatgctaaacacagtggggggacatcaagcagccatgcaaatgttaaaagagaccatcaatgaggaagctgcagaatgggatagattgcatccagtgcatgcagggcctcatccaccaggccagatgagagaaccaaggggaagtgacatagcaggaactactagtacccttcag gaacaaatagcatggatgacaaataatccacctatcccagtaggagaaatctataagagatggataatcctgggattaaataaaatagtaaggatgtatagccctaccagcattctggacataaaacaaggaccaaaggaaccctttagagactatgtagaccggttctataagactctaagagccgagcaagcttcacaggaggtaaaaaatt ggatgacagaaaccttgttggtccaaaatgcgaacccagattgtaagactattttaaaagcattgggaccagcagctacactagaagaaatgatgacagcatgtcagggagtgggaggacccggccataaagcaagagttttggcagaagcaatgagccaagtaacaaattcagctaccataatgatgcagaaaggcaattttaggaaccaaag aaaaattgttaagtgtttcaattgtggcaaagaagggcacatagccaaaaattgcagggcccctaggaaaaggggctgttggaaatgtggaaaggagggacaccaaatgaaagattgtactgagagacaggctaattttttagggaaaatctggccttcccacaggggaaggccagggaattttcctcagaacagactagagccaacagcccca ccagccccaccagaagagagcttcaggtttggggaagagacaacaactccctctcagaagcaggagctgatagacaaggaactgtatccttcagcttccctcaaatcactctttggcaacgaccccttgtcacaataaagataggggggcaactaaaggaagctctattagatacaggagcagatgatacagtattagaagaaataaatttgcc aggaagatggaaaccaaaaatgatagggggaattggaggttttatcaaagtaagacagtatgatcaaatactcgtagaaatctgtggacataaagctataggtacagtattagtaggacctacacctgtcaacataattggaagaaatctgttgactcagattggttgcactttaaattttcccattagtcctattgaaactgtaccagtaaaa ttaaagccaggaatggatggcccaaaagttaaacaatggccattgacagaagaaaaaataaaagcattagtagaaatctgtacagaaatggaaaaggaaggaaaaatttcaaaaatcgggcctgaaaatccatataatactccagtatttgccataaagaaaaaagacagtactaaatggagaaaattagtagatttcagagaacttaataaga aaactcaagacttctgggaagttcaattaggaataccacatcccgcagggttaaaaaagaaaaaatcagtaacagtactggatgtgggtgatgcatatttttcagttcccttagataaagaattcaggaagtacactgcatttaccatacctagtataaacaatgagacaccagggattagatatcagtacaatgtgcttccacagggatggaa aggatcaccagcaatattccaaagcagcatgacaaaaatcttagagccttttagaaaacaaaatccagacatagttatctatcaatacatggacgatttgtatgtaggatctgacttagaaatagggcagcatagaacaaaaatagaggaactgagacaacatctgttgaagtggggatttaccacaccagacaaaaaacatcagaaagaacct ccattcctttggatgggttatgaactccatcctgataaatggacagtacagcctatagtgctgccagaaaaggacagctggactgtcaatgacatacagaagttagtgggaaaattgaattgggcaagtcagatttacccagggattaaagtaaagcaattatgtagactccttaggggaaccaaggcactaacagaagtaataccactaacaa aagaagcagagctagaactggcagaaaacagggaaattctaaaagaaccagtacatggagtgtattatgacccatcaaaagacttaatagcggaaatacagaagcaggggcaaggtcaatggacatatcaaatttatcaagagccatttaaaaatctgaaaacaggaaaatatgcaagaatgaggggtgcccacactaatgatgtaaaacaatt aacagaggcagtgcaaaaaataaccacagaaagcatagtaatatggggaaagactcctaaatttaaactacccatacaaaaagaaacatgggaaacatggtggacagagtattggcaagccacctggattcctgagtgggagtttgtcaatacccctcccttagtaaaattatggtaccagttagagaaagaacccataataggagcagaaact ttctatgtagatggggcagctaacagggagactaaattaggaaaagcaggatatgttactaacaaagggagacaaaaagttgtctccataactgacacaacaaatcagaagactgagttacaagcaattcttctagcattacaggattctggattagaagtaaacatagtaacagactcacaatatgcattaggaatcattcaagcacaaccag ataaaagtgaatcagagatagtcagtcaaataatagagcagttaataaaaaaagaaaaggtctacctgacatgggtaccagcgcacaaaggaattggaggaaatgaacaagtagataaattagtcagtactggaatcaggaaagtactctttttagatggaatagataaagcccaagaagaacatgaaaaatatcacagtaattggagggcaat ggctagtgattttaacctgccacctgtggtagcaaaagagatagtagccagctgtgataaatgtcagctaaaaggagaagccatgcatggacaagtagactgtagtccaggaatatggcaactagattgtacacatttagaaggaaaaattatcctggtagcagttcatgtagccagtggatatatagaagcagaagttattccagcagaaaca gggcaggaaacagcatactttctcttaaaattagcaggaagatggccagtaaaaacagtacatacagacaatggcagcaatttcaccagtactacagttaaggccgcctgttggtgggcaggaatcaagcaggaatttggcattccctacaatccccaaagtcaaggagtagtagaatctataaataaagaattaaagaaagttataggacaga taagagatcaggctgaacatcttaagacagcagtacaaatggcagtattcatccacaattttaaaagaaaaggggggattggggggtacagtgcaggggaaagaatagtagacataatagcaacagacatacaaactaaagaactacaaaaacaaattacaaaaattcaaaattttcgggtttattacagggacagcagagatccactttggaa aggaccagcaaagcttctctggaaaggtgaaggggcagtagtaatacaagataatagtgacataaaagtagtgccaagaagaaaagcaaagatcattagggattatggaaaacagatggcaggtgatgattgtgtggcaagtagacaggatgaggattagaacatggaaaagtttagtaaaacaccatatgtatgtttcaaggaaagctaaggg atggttttatagacatcactatgaaagtactcatccgagaataagttcagaagtacacatcccactagggaatgcaaaattggtaataacaacatattggggtctacatacaggagaaagagactggcatttgggtcaaggagtctccatagaattgaggaaaaggagatatagcacacaattagaccctaacctagcagaccaactaattcat ctgcattactttgattgtttttcagaatctgctataagaaatgccatattaggacatatagttagccctaggtgtgaatatcaagcaggacataacaaggtaggatctctacagtacttggcactaacagcattagtaagaccaagaaaaaagataaagccacctttgcctagtgttacaaaactgacagaggatagatggaacaagccccaga agaccaagggccacaaagggaaccatacaatgaatggacactagaacttttagaggagctcaagaatgaagctgttagacattttcctaggatatggctccatagcttagggcaacatatctatgaaacttatggagatacttgggcaggagtggaagccataataagaattctgcaacaactgctgtttattcatttcagaattgggtgtcaa catagcagaatagacattcttcgacgaaggagagcaagaaatggagccagtagatcctagactagagccctggaagcatccaggaagtcagcctaggactgcttgtaccaattgctattgtaaaaagtgttgctttcattgccaagtttgtttcataacaaaaggcttaggcatctcctatggcaggaagaagcggagacagcgacgaagagct cctcaagacagtcagactcatcaagtttctctatcaaagcagtaagtagtacatgtaatgcaatctttacaaatattagcagtagtagcattagtagtagcagcaataatagcaatagttgtgtggtccatagtattcatagaatataggaaaataagaagacaaaacaaaatagaaaggttgattgatagaataatagaaagagcagaagaca gtggcaatgagagtgacggagatcaggaagaattatcagcacttgtggaaatggggcacgatgctccttgggatgttaatgatctgtaaagctgcagaaaatttgtgggtcacagtttattatggggtacctgtgtggaaagaagcaaccaccactctattttgtgcctcagatgctaaagcgtatgatacagaggtacataatgtttgggcca cacatgcctgtgtacccacagaccccaacccacaagaagtagaactgaagaatgtgacagaaaattttaacatgtggaaaaataacatggtagaccaaatgcatgaggatataattagtttatgggatcaaagcctaaagccatgtgtaaaattaaccccactctgtgttactttaaattgcactgattatgggaatgatactaacaccaataa tagtagtgctactaaccccactagtagtagcgggggaatggaggggagaggagaaataaaaaattgctctttcaatatcaccagaagcataagagataaagtgaagaaagaatatgcacttttttatagtcttgatgtaataccaataaaagatgataatactagctataggttgagaagttgtaacacctcagtcattacacaggcctgtcca aaggtatcctttgaaccaattcccatacattattgtgccccggctggttttgcgattctaaagtgtaatgataaaaagttcaatggaaaaggaccatgtacaaatgtcagcacagtacaatgtacacatggaattaggccagtagtatcaactcaactgctgttaaatggcagtctagcagaagaagaggtagtaattagatcagacaatttct cggacaatgctaaagtcataatagtacatctgaatgaatctgtagaaattaattgtacaagactcaacaacattacaaggagaagtatacatgtaggacatgtaggaccaggcagagcaatttatacaacaggaataataggaaaaataagacaagcacattgtaacattagtagagcaaaatggaataacactttaaaacagatagttacaaa attaagagaacaatttaagaataaaacaatagtctttaatcaatcctcaggaggggacccagaaattgtaatgcacagttttaattgtggaggggaatttttctactgtaattcaacacaactgtttaacagtacttggaatggtactgcatggtcaaataacactgaaggaaatgaaaatgacacaatcacactcccatgcagaataaaacaa attataaacatgtggcaggaagtaggaaaagcaatgtatgcacctcccatcagaggacaaattagatgttcatcaaatattacagggctgatattaacaagagatggtggtattaaccagaccaacaccaccgagattttcaggcctggaggaggagatatgaaggacaattggagaagtgaattatataaatataaagtagtaaaaattgaac cattaggagtagcacccaccaaggcaaagagaagagtggtgcaaagagaaaaaagagcagtgggaataataggagctatgctccttgggttcttgggagcagcaggaagcactatgggcgcagcgtcaatgacgctgacggtacaggccagacaattattgtctggtatagtgcaacagcagaacaatttgctgagggctattgaggcgcaaca gcatctgttgcacctcacagtctggggcatcaagcagctccaagcaagagtcctggctgtggaaagatacctaagggatcaacagctcctggggttttggggttgctctggaaaactcatttgcaccactgctgtgccttggaatactagttggagtaataaatctctgagtcagatttgggataacatgacctggatgcagtgggaaagggaa attgataattacacaagcttaatatacaacttaattgaagaatcgcaaaaccaacaagaaaagaatgaacaagagttattggaattagataactgggcaagtttgtggaattggtttagcataacaaattggctgtggtatataaaaatattcataatgatagtaggaggcttggtaggtttaagaatagtttttactgtactttctatagtaa atagagttaggcagggatactcaccattgtcgtttcagacgcgcctcccagccaggaggggacccgacaggcccgaaggaatcgaagaagaaggtggagagagagacagagacagatccggtcaattagtggatggattcttagcaattatctgggtcgacctgcggagcctgtgcctcttcagctaccaccgcttgagagacttactcttgat tgtaacgaggattgtggaacttctgggacgcagggggtgggaagccctcaaatattggtggaatctcctacaatattggattcaggaactaaagaatagtgctgttagcttgctcaacgccacagccatagcagtagctgagggaactgatagggttatagaagtattacaaagagcttgtagagctattctccacatacctagaagaataaga cagggcttagaaagggctttgcaataagatgggtggtaagtggtcaaaaagtagtaaaattggatggcctactgtaagggaaagaatgagaagagctgagccagcagcagatggggtgggagcagtatctcgagacctggaaaaacatggagcaatcacaagtagtaatacagcaactaacaatgctgattgtgcctggctagaagcacaagag gaggaggaggtgggttttccagtcagacctcaggtacctttaagaccaatgacttacaagggagcgttagatcttagccactttttaaaagaaaaggggggactggaagggctaatttggtcccagaaaagacaagacatccttgatttgtgggtccaccacacacaaggctacttccctgattggcagaactacacaccagggccagggatca gatatccactgacctttggttggtgcttcaagctagtaccagttgagccagagaaggtagaagaggccaatgaaggagagaacaacagattgttacaccctgtgagcctgcatgggatggaggacccggagaaagaagtgttagtatggaggtttgacagccgcctagtactccgtcacatggcccgagagctgcatccggagtactacaagga ctgctgacactgagctttctacaagggactttccgctggggactttccagggaggcgtggcctgggcgggactggggagtggcgagccctcagatgctgcatataagcagctgctttttgcctgtactgggtctctcttgttagaccagatctgagcctgggagctctctggctaactagggaacccactgcttaagcctcaataaagcttgcc ttgagtgcttca

From gene to protein and its function(s) Gene

> DNA sequence AATTCATGAAAATCGTATACTGGTCTGGTACCGGCAACAC TGAGAAAATGGCAGAGCTCATCGCTAAAGGTATCATCGAA TCTGGTAAAGACGTCAACACCATCAACGTGTCTGACGTTA ACATCGATGAACTGCTGAACGAAGATATCCTGATCCTGGG TTGCTCTGCCATGGGCGATGAAGTTCTCGAGGAAAGCGAA TTTGAACCGTTCATCGAAGAGATCTCTACCAAAATCTCTG GTAAGAAGGTTGCGCTGTTCGGTTCTTACGGTTGGGGCGA CGGTAAGTGGATGCGTGACTTCGAAGAACGTATGAACGGC TACGGTTGCGTTGTTGTTGAGACCCCGCTGATCGTTCAGA ACGAGCCGGACGAAGCTGAGCAGGACTGCATCGAATTTGG TAAGAAGATCGCGAACATCTAGTAGA

Function

> Protein sequence MKIVYWSGTGNTEKMAELIAKGIIESGKDVNTINVS DVNIDELLNEDILILGCSAMGDEVLEESEFEPFIEEIS TKISGKKVALFGSYGWGDGKWMRDFEERMNGYG CVVVETPLIVQNEPDEAEQDCIEFGKKIANI

Goals of Functional Genomics

What is the function of these structures?

What is the function of this sequence?

What is the function of this motif? –

–

the fold provides a scaffold, which can be decorated in different ways by different sequences to confer different functions knowing the fold & function allows us to rationalise how the structure effects its function at the molecular level

Bioinformatics is needed in all levels

DNA

5’

3’

Transcription Splicing

mRNA

Translation Poly-peptide

Folding

Protein

Function

Function

• Transport / Localization • Oligomerization • Post-Translational Modification

At Genome Level Genome Projects need to store and organize DNA sequences DNA

5’

3’


mRNA


Folding

Protein

Function

Function


At Transcription Level

DNA

How do we find protein coding regions, introns and exons in genomic 3’ DNA sequences?

5’


mRNA


Folding

Protein

Function

Function


At Transcription Level Under which condition is a certain gene transcribed? DNA

5’

3’


mRNA


Folding

Protein

Function

Function


At Translation Level

DNA

5’

3’


mRNA


Folding What do we know about a specific protein?

Protein

Function

Function


At Translation Level

DNA

5’

3’


mRNA


Folding How can we compare protein sequences?

Protein

Function

Function


At Structure Level

DNA

5’

3’


mRNA


Folding Can we predict protein structures?

Protein

Function

Function


Application and Use of databases Homology searching:Use of knowledge from other often more well described organisms such as the model organisms Mouse, Drosophila, Fugu, C.Elegans etc.. Sequence level – position, annotation Structural level – proteins, RNA Evolutionary analyses:Phylogenetics Population genetics Molecular evolution of genetic elements Genome evolution Primer design Microarray design Drug design Many more……

Web Access: http://www.ncbi.nlm.nih.gov

Text Searches Entrez System

Sequence Retrieval System (SRS)

BLAST: Sequence Similarity Searches

VAST: Structure Similarity Searches

BLAST: Basic Local Alignment Search Tool

NCBI BLAST http://www.ncbi.nlm.nih.gov/blast/

June 26, 2000 at the Whitehouse

In post genomic era- a new language has been created for new biology

• • • • •

Genomics Functional Genomics Proteomics cDNA microarrays Global Gene Expression Patterns

New Computational Tools are Needed • • • • • • • •

Sequencing Analyzing experimental data Representing vast quantities of information Searching Pattern matching Data mining Gene discovery Function discovery

High-throughput Technologies ➢Next Generation Sequencing (NGS) ➢Virtual Screening ➢Genotyping ➢SNP discovery ➢Gene express

High-throughput techniques High throughput protein crystalization Massive parallel sequencing

Mass spectrometry Microarray High throughput cell imaging

High throughput in vivo screening …

How to extract the information? Computational tools • Building the databases • Perform analysis/extract features

• Data mining • Classification/statistical learning • Visualization/representation

Biological information!!!

High throughput Instruments PCR

NGS

PG Era-High throughput DNA sequencing Centre

Sequencing factories

Custom-designed factory-style conveyor belt robots: Perform all functions from purifying DNA from bacterial cultures through setting up and purifying sequencing reactions.

Whitehead institute

Automated sequencing

DNA isolation

Mass spectrometer

Beckman Biomek FX

Sequencing

Affymetrix gene expression

Huge data -Biology • Consider biology: huge data are attainable since new automation is available: robotics, new chips, sequencing, imaging, etc. • Sensors measure everything from speed to smell.

Generation of Data ➢ Raw data from sequencing ➢ Expression data ➢ Data generated by linking other raw data in very large, multidimensional databases (e.g., OMIM) ➢ Research literature (full-text journals) ➢ Data models to describe the literature for retrieval, linking to other data, and linking to the raw data ➢ New data models to support greater flexibility in describing & manipulating data …

OVERLOADS OF OMICS INFORMATION Life Sciences have been highly affected by the generation of large data sets, specifically by overloads of omics information -genomes, -transcriptomes, -Epigenomes & other omics data from cells, tissues and organisms

NGS Platforms- data generations • Next-Generation Sequencing (NGS) platforms that use semiconductors or nanotechnology have exponentially increased the rate of biological data generation in the last two years.

Real-time/Fast Data

Mobile devices (tracking all objects all the time)

Scientific instruments (collecting all sorts of data)

Sensor technology and networks (measuring all kinds of data)

• The progress and innovation is no longer hindered by the ability to collect data • But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a 55 scalable fashion

Next (second) Generation Sequencing • New technologies allowing the massive production of tens of millions of short sequencing fragments. Thus, it is also called: “Massively parallel sequencing” • These techniques could be used to – deal with similar problems than microarrays,

• They raised the promise of personalized medicine Setia Pramana

59

NGS • The advent of high-throughput sequencing technologies has initiated the ‘personal genome sequencing’ era for both normal and cancer genomes ➢Large-scale international projects such as the 1000 Genomes Project and ➢`the International Cancer Genome Consortium Setia Pramana

60

NGS • NGS technologies have been on the market only since 2004 • Have now largely replaced Sanger sequencing technologies (owing to the ultra-highthroughput production/hundreds gigabases) • Ability to simultaneously sequence millions of DNA fragments - massively parallel sequencing technologies Setia Pramana

61

NGS - The Big Picture ➢8.7 million species in the world (estimate) ➢• ~ 7 billion people ➢• Sequencers exist in both large centres & small research groups ➢• 200+ Ilumina HiSeq sequencers in Europe alone ➢• capacity to sequence 1600 human genomes / month ➢• Largest centre: Beijing Genomics Institute (BGI) ➢• ~140 HiSeq ➢• ~1500 Hiseq devices worldwide today ➢• 3-6 PB / day ➢• 1.1 – 2.2 ExaBbytes / year

Next Generation Sequencing (NGS) Revolution

Huge Data Storage and HPC Demand

NGS Challenges • Highest cost is not the sequencing but storage and analysis. • A standard human (30-40x) whole genome sequencing would create 100 Gb of data • Extreme data size causes problems • Just transferring and storing the data • Standard tools can not be used • Think in fast and parallel programs Setia Pramana

Growth of WGS in Gene bank

Gene bank---

The trend of data growth century is a century of biotechnology:

Genomics: New sequence information is being



produced at increasing rates. (The contents of GenBank double every year)

Nucleotides(billion)

21st

8 7 6 5 4 3 2 1 0 1980

1985

1990

Years



Microarray: Global expression analysis: RNA levels of every gene in the genome analyzed in parallel.



Proteomics:Global protein analysis generates by large mass spectra libraries.



Metabolomics:Global metabolite analysis: 25,000 secondary metabolites characterized



Glycomics:Global sugar metabolism analysis

1995

2000

The Challenges of Molecular Biology Computing • The big dataset problem • DNA sequencing • Pairwise and Multiple Alignments • Similarity searching the databanks • Structure-function relationships; Can sequence patterns predict function? • Phylogenetic analysis: Sequence conservation across evolution • Genomics

Development of Algorithms for Sequence Comparison: • Phylogenetic Algorithm • Complex mathematical formula used to determine sequence homology • All possible ways a large number of sequences can be compared to one another • Fitch and Margoliash • Sequence comparisons to determine evolutionary trees • Computer calculates the minimum number of steps to convert one sequence to another and builds possible trees • Needleman and Wunsch • Similarities in protein sequences

Major breakthroughs in Bioinformatics Through innovations in Mathematics or Statistics FASTA, BLAST, Phred/Phrap, BLOSUM, GenScan, PSI-BLAST, Threading, GRAIL etc.

Data at Petabyte scale

Petabyte Scale Data Generation

1024 Yottabytes

1021

Theater Data Stream (2006): ~270 TB of NTM data / year Example: One Theater’s Storage Capacity:

Zettabytes

1018

250 TB

Exabytes

Large Data JCTD

12 TB 2006

Capability Gap

2010

1015 Petabytes

UUVs FIRESCOUT VTUAV DATA

1012 Terabytes

GIG Data Capacity (Services, Transport & Storage) 2000

Today

2010

2015 & Beyond

Bob Gourley http://ctovision.typepad.com/InfoSharingTechnologyFutures.ppt

A New Era Of Analytic

The Myth About Big Data

 Big Data Is New  Big Data Is Only About Massive Data Volume  Big Data Means Hadoop  Big Data Need A Data Warehouse  Big Data Means Unstructured Data  Big Data Is for Social Media & Sentiment Analysis

BIG DATA • Big data refers to very large datasets with complex structures that are difficult to process using traditional methods and tools. • The term process includes, capture, storage, formatting, extraction, curation, integration, analysis, and visualization • Big data and analytics are intertwined

Big Data Born

• Google, eBay, LinkedIn, and Facebook were built around Big Data from the beginning. • Big Data could stand alone. • Big Data analytics could be the only focus of analytics

Four Characteristics of Big Data Cost efficiently processing the growing Volume 50x

35 ZB

Responding to the increasing Velocity

30 Billion RFID sensors and counting

Collectively Analyzing the broadening

Variety 80% of the worlds data is unstructured

2010

2020

Establishing the Veracity of big data sources

Where Is This “Big Data” Coming From ? 4.6 billion camera phones world wide

100s of millions of GPS enabled

data every day

? TBs of

12+ TBs of tweet data every day

30 billion RFID tags today (1.3B in 2005)

devices sold annually

2+ billion

25+ TBs of log data every day 76 million smart meters in 2009… 200M by 2014

people on the Web by end 2011

Deluge of Data ➢Data is now considered the Fourth Paradigm in Science ➢-the first three paradigms were experimental, theoretical and computational science. ➢-This shift is being driven by the rapid growth in data from improvements in scientific instruments

Transfer of data from one location to another • shipping external hard disks • – processing the data while it is being transferred • – Future? Data won’t be moved!

Cloud computing-• Increased need to store data • Cloud–based computing solutions have emerged

DERIVING VALUE VIA HARNESSING ➢ VOLUME, ➢ VARIETY ➢ VELOCITY

The big data problem: The big data problem:

In the end it is a Computing Challenge

Computing Challenge Researchers need to crunch a large amount of data very quickly (and easily) using high-performance computers

Example: A de novo assembly algorithm for DNA data finds reads whose sequences “overlap” and records those overlaps in a huge diagram called an assembly graph. ➢ For a large genome, this graph can occupy many terabytes of RAM, and ➢ completing the genome sequence can require weeks or months of computation on a world-class supercomputer.

data deluge appears in genomics • The DNA data deluge comes from thousands of Sources • More than 2000 sequencing instruments around the world – more than 15 petabytes x year of genetic data. And soon, tens of thousands of sources!!!!

The total computing burden is growing • Computing, not sequencing, is now the slower and more costly aspect of genomics research

How can we help • What is clear is that it will involve both better algorithms • a renewed focus on such “big data” approaches in managing and processing data.

What is the time required to retrieve information? • Assume 100MB/sec • scanning 1 Terabyte:more than 5 hours

Solution? • massive parallelism not only in computation but also in storage • How do companies like google read and process data from 10.000 disks in parallel

New programming model • To meet the challenges: MapReduce – Programming Model introduced by Google in early 2000s to support distributed computing • Ecosystem of big data processing tools – open source, distributed, and run on commodity hardware.

Computational Facility • Current computer systems available at genomics research institutions are commonly designed to run general computational jobs where, • – Traditionally the limiting resource is the use of CPU. • – Also, we find a large common storage space shared for all nodes.

Important: Remote Nodes are Closer Interconnects have become much faster

Big Data Analytics • Big data is more real-time in nature than traditional DW applications • Traditional DW architectures (e.g. Exadata, Teradata) are not well-suited for big data applications • Massively parallel processing, scale out architectures are well-suited for big data apps 107

Why Big Data • Growth of Big Data needs ➢ Increase of storage capacities ➢ Increase of processing power ➢ Availability of data(different data types) ➢ Every day we create 2.5 quintillion bytes of data; 90% of the data in the world today has been created in the last two years alone

How Is Big Data Different? 1) Automatically generated by a machine (e.g. Sensor embedded in an engine) 2) Typically an entirely new source of data (e.g. Use of the internet) 3) Not designed to be friendly (e.g. Text streams) 4) May not have much values • Need to focus on the important part

109

Estimates of Big Data volume • In 2020, there will be 35 zettabytes of digital data. • That represents a stack of DVD’s that would reach half way from the Earth to Mars. • Facebook has 70 petabytes and 2700 multiprocessor nodes. • The Bing search engine has 150 petabytes and 40,000 nodes. • Big Data: Techniques and Technologies that Make Handling Data at Extreme Scale Economical.

Why do we care about Big Data? • Data is the new oil – we have to learn how to mine it! Qatar – European Commission Report • $ 7 trillion economic value in 7 US sectors alone • New McKinsey 4th factor of production :Land, Labor, Capital, + Data • Gartner: hundreds of billions of GDP by 2020 • Big Data: new driver for digital economy & society • $90 B annually in sensitive devices

Current Situation • Soon it will cost less to sequence a base of DNA than to store it on hard disk • We are facing potential tsunami of genome data that will swamp our storage system and crush our compute clusters

The research in omics sciences is moving from a hypothesis-driven to a data-driven approach • Efficient analysis and interpretation of Big Data opens new avenues to explore molecular biology • New paradigms are needed to store and access data, for its annotation and integration and finally for inferring knowledge and making it available to researchers. • A clear awareness of present high performance computing (HPC) solutions in bioinformatics is need of the day • Big Data analysis paradigms for computational biology

The Human Genome Project • First draft genome of human in 2001, final 2004 • Estimated costs $3 billion, time 13 years • Used Sanger Sequencing • Today: Illumina: 1 week, 9500$ Exome: 6 weeks*, $1000 Towards 1000$ genome Setia Pramana

116

Human Genome Project ➢EBI is building a cloud-based infrastructure called Helix Nebula — The Science Cloud. ➢In biomedicine the Human Genome Project is determining the sequences of the three billion chemical base pairs that make up human DNA.

Impact • Decoding the human genome originally took 10 years to process; now it can be achieved in a few hours • human genome was a $3 billion dollar project requiring over a decade to complete in 2002. • Now we are able to sequence and analyze an entire genome for less than a thousand dollars.

Big Data and Web 2.0 • O’Reilly Media introduced the term “Big Data” a year after Web 2.0 appeared, as many valuable Big Data situations are indeed related to consumer conduct. • Web 2.0 provided the impulse to rethink the interaction that was taking place on Internet, and to push it somewhat further.

Potential Value & Benefits of Big data Analytics

Big data analytics can accumulate ✓ the

wisdom of crowds

✓ reveal ✓

patterns

yield best practices

Types of tools used in Big-Data • Where processing is hosted? – Distributed Servers / Cloud (e.g. Amazon EC2)

• Where data is stored? – Distributed Storage (e.g. Amazon S3)

• What is the programming model? – Distributed Processing (e.g. MapReduce)

• How data is stored & indexed? – High-performance schema-free databases (e.g. MongoDB)

• What operations are performed on data? – Analytic / Semantic Processing

Major Big Data Technologies Several technologies and frameworks have been deveoped to handle big data like • Hadoop (or Apache Hadoop) – This is by far the most popular Big Data tool -an open source platform -major applications of Hadoop is to process and manage large volumes of persistently changing data. • Map Reduce- This is a foundation framework for Hadoop. It allows handling of massive volumes of data in parallel distributed processing environment. • No SQL- These are referred to as not only SQL databases, which are very different from the traditional “relational” databases. Unlike the relational databases, No SQL does not require any specific table schemas for data handling. • Grid Computing – This is a special type of distributed computing where a connection is established between multiple geographically dispersed computer sources. These resources operate in parallel to handle large chunks of data. • In-memory databases – These are databases which use the main memory of the system for data processing. These are used in systems where response time and data requests are considerably high. • Specialized databases – These are big databases which manage and process data providing specific information.

The Techniques and Methods of Analytics Learning analytics draws upon techniques from a number of established fields: – Statistics – Artificial Intelligence – Machine Learning – Data mining – Social Network Analysis – Text Mining and Web Analytics – Operational Research – Information Visualization

Benefits • Cost & management – Economies of scale, “out-sourced” resource management

• Reduced Time to deployment – Ease of assembly, works “out of the box”

• Scaling – On demand provisioning, co-locate data and compute

• Reliability – Massive, redundant, shared resources

• Sustainability – Hardware not owned

Cloud Computing • Computing - Which Dynamically Scalable • And Virtualized Resources Are Provided As A Service Over The Internet”.

Cloud Computing • IT resources provided as a service – Compute, storage, databases, queues

• Clouds leverage economies of scale of commodity hardware – Cheap storage, high bandwidth networks & multicore processors – Geographically distributed data centers

• Offerings from Microsoft, Amazon, Google, …

wikipedia:Cloud Computing

Types of Cloud Computing • Public Cloud: Computing infrastructure is hosted at the vendor’s premises. • Private Cloud: Computing architecture is dedicated to the customer and is not shared with other organisations. • Hybrid Cloud: Organisations host some critical, secure applications in private clouds. The not so critical applications are hosted in the public cloud – Cloud bursting: the organisation uses its own infrastructure for normal usage, but cloud is used for peak loads.

• Community Cloud

Classification of Cloud Computing based on Service Provided • Infrastructure as a service (IaaS) – Offering hardware related services using the principles of cloud computing. These could include storage services (database or disk storage) or virtual servers. – Amazon EC2, Amazon S3, Rackspace Cloud Servers and Flexiscale.

• Platform as a Service (PaaS)

•

– Offering a development platform on the cloud. – Google’s Application Engine, Microsofts Azure, Salesforce.com’s force.com . Software as a service (SaaS) – Including a complete software offering on the cloud. Users can access a software application hosted by the cloud vendor on payper-use basis. This is a well-established sector. – Salesforce.coms’ offering in the online Customer Relationship Management (CRM) space, Googles gmail and Microsofts hotmail, Google docs.

Infrastructure as a Service (IaaS)

Cloud Platforms • • • • •

Amazon Elastic Compute Cloud Google App Engine Microsoft Azure GoGrid AppNexus

Bioinformatics Cloud

CloudBurst • Running time on EC2 High-Medium Instance Cluster • Comparison of CloudBurst running time while scaling the size of the cluster for mapping 7M reads to human chromosome 22 with at most four mismatches on the EC2 Cluster. • The 96- core cluster is 3.5X faster than the 24- core cluster. • [CloudBurst: highly sensitive read mapping with Map)

EBI-Cloud • The EBI is building a cloud-based infrastructure called Helix Nebula — The Science Cloud. • Clouds are a solution, but they also throw up fresh challenges • Harnessing powerful computers and numerous tools for data analysis is crucial in drug discovery and other areas of big-data biology

Bioinformatics applications • Alu Sequence Classification is one of the most challenging problems for sequence clustering because Alus represent the largest repeat families in human genome. • EST (Expressed Sequence Tag) corresponds to messenger RNAs transcribed from genes residing on chromosomes. • Each individual EST sequence represents a fragment of mRNA, and the EST assembly aims to reconstruct fulllength mRNA sequence for each expressed gene.

The distinction between big data technologies and cloud computing

• - cloud computing is often used to facilitate the cost effective storage of such large datasets. - big data technologies are often offered as Platform as a Service (PaaS) within a cloud environment. - These technologies, while often coinciding, are distinct and can operate mutually exclusively

MapReduce • One of the first MapReduce projects applied in the biotechnology space resulted in the Genome Analysis Tool Kit (GATK) . • CloudBurst was one of the first of these, developed by Michael Schatz et al. • Crossbow for SNP identification, using Hadoop’s massive sort engine to order the alignments along the genome and then genotyping sample using SoapSNP. • As input it aligns a mix of 3 billion paired-end and unpaired reads, equivalent to 110 GB of compressed sequence data, and as output it catalogues all the SNPs in the genome. • CrossBow can genotype a human in approximately 3 hours on a 320 core cluster, discovering 3.7 million SNPs at >99% accuracy for $100 using AWS EC2.

Cloud BioLinux • Cloud BioLinux, created by the J. Craig Venter Institute (JCVI) is an example of SaaS. • It is a publicly accessible virtual machine that is stored at Amazon EC2, is freely available • User friendly Graphical User Interface (GUI), along with over 100 pre-installed bioinformatics tools including Galaxy , BioPerl, BLAST, Bioconductor, Glimmer, GeneSpring, ClustalW and EMBOSS utilities, amongst others. • While Linux based Bioinformatics distributions such DNALinux, BioSlax ,BioKnoppix, DebianMed, are not unusual, they are built to run on standalone local machines. • SaaS initiatives such as BioLinux have been known to be referred to as Science as a Service (ScaaS).

SaaS VM • Another significant advantage of using such SaaS VM images on a public cloud, such as Amazon, is that Amazon provides access to several large genomic data sets including • the 1000 Genome project, • NCBI, GenBank and Ensembl. • CloVR provides a similar image with pre-installed packages. • Standalone bio/medical software applications/suites with a cloud backend include Taverna , FX ,SeqWare ,BioVLab , and commercial equivalents such as DNAnexus .

Parallelised big data technologies and genomics Genomics research is the the dream use case for big data technologies which, if unified, are likely to have a profoundly positive impact on mankind

The Data Deluge – Wired 16.07

Paradigm shift Old Compute-centric Model(L) to New Data-centric Model

PROTEOMICS in Aquaculture: Applications and trends • Omic technologies (i.e., genomics,metabolomics and proteomics) have been widely implemented in the field of farm animal proteomics with a very positive impact in areas such as aquaculture. • Proteomics, as a powerful comparative tool, has been increasingly used over the last decade to address different questions in aquaculture, regarding welfare, nutrition,health, quality, and safety.

• A surprising wealth of tools has already been generated for genome mapping and functional studies in many of the species used in aquaculture. • With the potential for sequencing on the horizon, the future is bright for aquaculture genomics.

.

Recommendations • Success in Aquaculture research dealing with the increasing amounts of omics data combined with clinical information will depend on our ability to interpret high scale data sets that are generated by emerging technologies

Recommendations • Private companies such as Microsoft, Oracle, Amazon, Google, Facebook and Twitter are masters in dealing with petabyte scale data sets. • Aquaculture will need to implement the same type of scalable structure to deal with volumes of data generated by omics technologies. • The life sciences will need to adapt to the advances in informatics to successfully address the Big Data problems that will be faced in the next decade.

Big data and privacy concerns

Privacy Concern--Accoding to some experts, the ability to protect medical and genomics data in the era of big data and a changing privacy landscape is a growing challenge.

The growing globalization of data flows via big data increases the risk that people can lose control of their own data

Celia Fernández Aller ETSISI UPM

THANK U for Patient hearing