Cloud-based Services

BIG DATA ANALYTICS APPROACH TO UNEARTH KNOWLEDGE FROM LARGE SCALE GENOMIC DATA(Abstract No. GI(O)-4) )

Ajit Kumar Roy • • • •

Ex. National Consultant (IA) for East and North East Region, NAIP,ICAR Ex. Consultant (Statistics), College of Fisheries, CAU,, Agartala Ex. Principal Scientist & Head Social Science & Co-coordinator, Bioinformatics Centre of CIFA (ICAR), Bhubaneswar; Ex. Computer Specialist, SAARC Agricultural Information Centre,, Dhaka, Bangladesh

Presented at 2nd International Symposium on Genomics in Aquaculture January 28-30, 2016 held at ICARCIFA,Bhubaneswar,India

Contents ➢ Introduction & basics of big data ➢ Cloud Computing ➢ Potential value & benefit of big data analytics ➢ Advances in bioinformatics /Computational Biology ➢ Growth of Genomic Data ➢ Big data tools ➢ Applied computational GenomicsGenomics ➢ Big data and privacy concerns

A New Era Of Analytic

The Data Deluge – Wired 16.07

The F o u r th Paradigm:Data-Int ens i v e Scientific Discov ery-byTony Hey, Stewart Tansley,and Kristin Tolle

What is Big Data?

“The collection, analysis and generation of insights from a wide variety of data sources in order to be able to improve performance”

The Myth About Big Data

 Big Data Is New  Big Data Is Only About Massive Data Volume  Big Data Means Hadoop  Big Data Need A Data Warehouse  Big Data Means Unstructured Data  Big Data Is for Social Media & Sentiment Analysis

BIG DATA • Big data refers to very large datasets with complex structures that are difficult to process using traditional methods and tools. • The term process includes, capture, storage, formatting, extraction, curation, integration, analysis, and visualization • Big data and analytics are intertwined

Big Data Born • Google, eBay, LinkedIn, and Facebook were built around Big Data from the beginning. • Big Data could stand alone. • Big Data analytics could be the only focus of analytics • Big Data technology architectures could be the only architecture • No merging Big Data technologies with their traditional IT infrastructures

What is BIG DATA • Decoding the human genome originally took 10 years to process; now it can be achieved in one week. • Wal-Mart handles more than 1 million customer transactions every hour. • Face book handles 40 billion photos from its user base.

Big Data Is..

It is all about better Analytic on a broader spectrum of data, and therefore represents an opportunity to create even more differentiation among industry peers.

Four Characteristics of Big Data Cost efficiently processing the growing Volume 50x

35 ZB

Responding to the increasing Velocity

30 Billion RFID sensors and counting

Collectively Analyzing the broadening

Variety 80% of the worlds data is unstructured

2010

2020

Establishing the Veracity of big data sources

Where Is This “Big Data” Coming From ? 4.6 billion camera phones world wide

100s of millions of GPS enabled

data every day

? TBs of

12+ TBs of tweet data every day

30 billion RFID tags today (1.3B in 2005)

devices sold annually

2+ billion

25+ TBs of log data every day 76 million smart meters in 2009… 200M by 2014

people on the Web by end 2011

With Big Data, We’ve Moved into a New Era of Analytics

12+ terabytes

5+ million

of Tweets create daily.

100’s of different types of data.

trade events per second.

Volume

Velocity

Variety

Veracity

Only

1 in 3

decision makers trust their information.

The number of organizations who see analytics as a competitive advantage is growing.

63% 2010

business initiative

2011 BUSINESS IMPERATIVE2012

Studies show that organizations competing on analytics outperform their peers substantially outperform

IBM IBV/MIT Sloan Management Review Study 2011 Copyright Massachusetts Institute of Technology 2011

1.6x 16

Revenue Growth

2.0x

EBITDA Growth

Volume (Scale) • Data Volume – 44x increase from 2009 to 2020 – From 0.8 zettabytes to 35zb

• Data volume is increasing exponentially

Exponential increase in collected/generated data

17

Variety (Complexity) • • • •

Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) Graph Data – Social Network, Semantic Web (RDF), …

•

Streaming Data – You can only scan the data once

•

A single application can be generating/collecting many types of data

•

Big Public Data (online, weather, finance, etc)

To extract knowledge all these types of data need to linked together

18

Velocity (Speed) • Data is being generated fast and need to be processed fast • Online Data Analytics • Late decisions  missing opportunities • Examples – Healthcare monitoring: sensors monitoring your activities and body  any abnormal measurements require immediate reaction

19

Real-time/Fast Data

Mobile devices (tracking all objects all the time) Social media and networks (all of us are generating data)

Scientific instruments (collecting all sorts of data)

Sensor technology and networks (measuring all kinds of data)

• •

The progress and innovation is no longer hindered by the ability to collect data But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion

20

Harnessing Big Data

• • •

OLTP: Online Transaction Processing (DBMSs) OLAP: Online Analytical Processing (Data Warehousing) RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)

21

The Model Has Changed… • The Model of Generating/Consuming Data has Changed Old Model: Few companies are generating data, all others are consuming data

New Model: all of us are generating data, and all of us are consuming data

22

Big Data Analytics • Big data is more real-time in nature than traditional DW applications • Traditional DW architectures (e.g. Exadata, Teradata) are not wellsuited for big data apps • Shared nothing, massively parallel processing, scale out architectures are well-suited for big data apps

23

Big Data Technology

24

Cloud Computing • IT resources provided as a service – Compute, storage, databases, queues

• Clouds leverage economies of scale of commodity hardware – Cheap storage, high bandwidth networks & multicore processors – Geographically distributed data centers

• Offerings from Microsoft, Amazon, Google, …

wikipedia:Cloud Computing

Benefits • Cost & management – Economies of scale, “out-sourced” resource management

• Reduced Time to deployment – Ease of assembly, works “out of the box”

• Scaling – On demand provisioning, co-locate data and compute

• Reliability – Massive, redundant, shared resources

• Sustainability – Hardware not owned

Types of Cloud Computing • Public Cloud: Computing infrastructure is hosted at the vendor’s premises. • Private Cloud: Computing architecture is dedicated to the customer and is not shared with other organisations. • Hybrid Cloud: Organisations host some critical, secure applications in private clouds. The not so critical applications are hosted in the public cloud – Cloud bursting: the organisation uses its own infrastructure for normal usage, but cloud is used for peak loads.

• Community Cloud

Classification of Cloud Computing based on Service Provided • Infrastructure as a service (IaaS) – Offering hardware related services using the principles of cloud computing. These could include storage services (database or disk storage) or virtual servers. – Amazon EC2, Amazon S3, Rackspace Cloud Servers and Flexiscale.

• Platform as a Service (PaaS)

•

– Offering a development platform on the cloud. – Google’s Application Engine, Microsofts Azure, Salesforce.com’s force.com . Software as a service (SaaS) – Including a complete software offering on the cloud. Users can access a software application hosted by the cloud vendor on payper-use basis. This is a well-established sector. – Salesforce.coms’ offering in the online Customer Relationship Management (CRM) space, Googles gmail and Microsofts hotmail, Google docs.

Infrastructure as a Service (IaaS)

More Refined Categorization • • • • • • • • •

Storage-as-a-service Database-as-a-service Information-as-a-service Process-as-a-service Application-as-a-service Platform-as-a-service Integration-as-a-service Security-as-a-service Management/ Governance-as-a-service • Testing-as-a-service • Infrastructure-as-a-service InfoWorld Cloud Computing Deep Dive

Key Ingredients in Cloud Computing • • • • • • •

Service-Oriented Architecture (SOA) Utility Computing (on demand) Virtualization (P2P Network) SAAS (Software As A Service) PAAS (Platform AS A Service) IAAS (Infrastructure AS A Servie) Web Services in Cloud

Cloud versus cloud • • • • •

Amazon Elastic Compute Cloud Google App Engine Microsoft Azure GoGrid AppNexus

Sources of Big Data Social media, web analytics, log files, sensors, and such like all provide valuable information, while the cost of it solutions continues to drop and computer-processing power is increasing. • IDC expects more than 1 billion sensors to be connected to the Internet. All the accompanying data flows can supply interesting insights that can aid better business decisions.

• The Internet of Things is also becoming an increasingly rich source of data. Presently, there are 13 billion devices connected to the Internet and that will be 50 billion in 2020 says Cisco CTO Padmasree Warrior.

Why Big Data • Growth of Big Data needs ➢ Increase of storage capacities ➢ Increase of processing power ➢ Availability of data(different data types)

➢ Every day we create 2.5 quintillion bytes of data; 90% of the data in the world today has been created in the last two years alone

How Is Big Data Different? 1) Automatically generated by a machine (e.g. Sensor embedded in an engine) 2) Typically an entirely new source of data (e.g. Use of the internet) 3) Not designed to be friendly (e.g. Text streams) 4) May not have much values • Need to focus on the important part

36

Who is collecting all these data? Hospitals & Other Medical Systems

Banking & Phone Systems

Pharmacies Laboratories Imaging Centers

Emergency Medical Services (EMS) Hospital Information Systems Doc-in-a-Box

Electronic Medical Records Blood Banks Birth & Death Records

Can you hear me now? (Heh heh heh!)

Who is collecting all of this data? Government Agencies

Big Pharmaceutical Companies

Estimates of Big Data volume • In 2020, there will be 35 zettabytes of digital data. That represents a stack of DVD’s that would reach half way from the Earth to Mars. • Facebook has 70 petabytes and 2700 multiprocessor nodes. • The Bing search engine has 150 petabytes and 40,000 nodes. • The simplest response comes from Forrester Research and is as follows: • Big Data: Techniques and Technologies that Make Handling Data at Extreme Scale Economical.

Why do we care about Big Data? • Data is the new oil – we have to learn how to mine it! Qatar – European Commission Report • $ 7 trillion economic value in 7 US sectors alone • New McKinsey 4th factor of production :Land, Labor, Capital, + Data • Gartner: hundreds of billions of GDP by 2020 • Big Data: new driver for digital economy & society • An insurance firm with 5 terabytes of data on share drives pays $1.5 m per year • $90 B annually in sensitive devices

Application Of Big Data analytics Smarter Healthcare

Homeland Security

Traffic Control

Manufacturing

Multi-channel sales

Telecom

Trading Analytics

Search Quality

Studies show that organizations competing on analytics outperform their peers substantially outperform

IBM IBV/MIT Sloan Management Review Study 2011 Copyright Massachusetts Institute of Technology 2011

1.6x 44

Revenue Growth

2.5x

Stock Price Appreciation

2.0x

EBITDA Growth

Types of tools used in Big-Data • Where processing is hosted? – Distributed Servers / Cloud (e.g. Amazon EC2)

• Where data is stored? – Distributed Storage (e.g. Amazon S3)

• What is the programming model? – Distributed Processing (e.g. MapReduce)

• How data is stored & indexed? – High-performance schema-free databases (e.g. MongoDB)

• What operations are performed on data? – Analytic / Semantic Processing

The Techniques and Methods of Analytics Learning analytics draws upon techniques from a number of established fields: – Statistics – Artificial Intelligence – Machine Learning – Data mining – Social Network Analysis – Text Mining and Web Analytics – Operational Research – Information Visualization

McKinsey Global Institute in 2011 provided a list of the top 10 common techniques applicable across a range of industries.

Techniques • • • • • • • • •

Testing Cluster Analysis Classification Data Mining Network analysis Predictive modelling Sentiment analysis Statistics Visualization

IMPACT OF BIG DATA IN LIFE SCIENCES and HEALTH CARE Big data can also be seen in the life sciences where big sets of data such as ✓genome sequencing, ✓clinical data and patient data are analyzed and used to advance breakthroughs in science in research.

Human Genome Project ➢EBI is building a cloud-based infrastructure called Helix Nebula — The Science Cloud.

➢In biomedicine the Human Genome Project is determining the sequences of the three billion chemical base pairs that make up human DNA.

Genome biology and the study of cell circuits -the most important Big Data challenges of our time. • A fundamental Big Data development in the field of healthcare is the ambition of the Broad Institute, an initiative of MIT and Harvard, to expand the Human Genome Project, which was eventually rounded off in 2003. • Broad Institute is currently researching the cell mutations that cause cancer, the molecular structure of the viruses, bacteria etc. that are responsible for infectious illnesses, and the possibilities of their use in the development of medicines.

Digital healthcare data expected to reach 25,000 petabytes in 2020 • Standard medical practice is moving from relatively ad-hoc and subjective decision making to evidence-based healthcare. • Healthcare Data is becoming more complex

Global Viral Forecasting Initiative (GVFI) • The San Francisco-based Global Viral Forecasting Initiative (GVFI) uses advanced data analysis on information mined from the Internet to identify comprehensively the locations, sources and drivers of local outbreaks before they become global epidemics- GVFI‟s Chief Innovation Officer, Lucky Gunasekara, says. • Public health offers one of the most compelling areas where the analysis of mobile and Internet data could lead to huge public gains.

Big Data is developing most rapidly in the world of Big Science In 10 years,2800 radio telescopes in the Square Kilometer Area project (ska), the largest Big Science project ever, will generate ➢ 1 billion gigabytes of data daily. ➢ That is equal to the entire Internet on a weekday in 2012. ➢ In 2012, Google processes a total of 5 petabytes or 5000 terabytes per hour.

Large Hadron Collider • Another example of Big Data is the Large Hadron Collider, at the European Organisation for Nuclear Research (CERN), which has • 150 million sensors and is creating22 petabytes of data in 2012.

Maximilien Brice, © CERN

CERN’s Large Hydron Collider (LHC) generates 15 PB a year

The Earthscope •The Earthscope is the world's largest science project. •Designed to track North America's geological evolution, this observatory records data over 3.8 million square miles, amassing 67 terabytes of data. (http://www.msnbc.msn.com/id/44363598/ns/tech nology_and_sciencefuture_of_technology/#.TmetOdQ--uI)

International Virtual Observatory Alliance The Fourth Paradigm, is that the volume of data available to us is so large, now often termed Big Data, that it is both presenting many new opportunities for analysis as well as requiring new modes of thinking, for example in the International Virtual Observatory Alliance (http://virtualobservatory.org) and in citizen science.

The Virtual Astronomical Observatory (VAO) • The Virtual Observatory (VO) is an international effort to bring a large-scale electronic integration of astronomy data, tools, and services to the global community • These tools will make use of data collected by the most powerful telescopes in the world.

Square Kilometre Array (SKA) to produce one exabyte data/day

One clear example of Big Data is the Square Kilometre Array (SKA) planned to be constructed in South Africa and Australia. When the SKA is completed in 2024 it will produce in excess of one exabyte of raw data per day, which is more than the entire daily internet traffic at present. (www.skatelescope.org)

SKA The SKA is a 1.5 billion Euro project that will have more than 3000 receiving dishes to produce a combined information collecting area of one square kilometre, and will use enough optical fibre to wrap twice around the Earth.

Huge Data from Satellites in orbit In Earth observation there are over 200 satellites in orbit continuously collecting data about the atmosphere and the land, ocean and ice surfaces of planet Earth with pixel sizes ranging from 50 cm to many tens of kilometres.

World Data System (WDS) Approved members of the World Data System so far include I. centres for Antarctic data (Hobart), II. climate data (Hamburg), III. ocean data (Washington DC), IV. environment data (Beijing) V. solid Earth physics data (Moscow) VI. plus the International Laser Ranging Service and the International Very Long Baseline Interferometry Service. VII. It is anticipated that the WDS will comprise over 100 centres and networks of active, professional data management.

The objectives of the World Data System (WDS) • Enable universal and equitable access to quality-assured scientific data, data services, products and information; • Ensure long term data stewardship; Foster compliance to agreed-upon data standards and conventions; • Provide mechanisms to facilitate and improve access to data and data products

Improvements in professional data management will result in better science Big Data presents science with many challenges, but at the same time presents many opportunities to influence how science grows and develops for the better.

With Big Data, We’ve Moved into a New Era of Analytics

12+ terabytes

5+ million

of Tweets create daily.

100’s of different types of data.

trade events per second.

Volume

Velocity

Variety

Veracity

Only

1 in 3

decision makers trust their information.

Big Data and Web 2.0 • O’Reilly Media introduced the term “Big Data” a year after Web 2.0 appeared, as many valuable Big Data situations are indeed related to consumer conduct. • Web 2.0 provided the impulse to rethink the interaction that was taking place on Internet, and to push it somewhat further. • In much the same way, the qualification “Big Data” calls attention to the business possibilities of the flood of data on the one hand, and the new technologies, techniques and methods that are directed toward these, on the other.

Potential Value & Benefits of Big data Analytics Big data analytics can accumulate ✓ the

wisdom of crowds

✓ reveal

✓

patterns

yield best practices

Benefits of Big Data •Real-time big data isn’t just a process for storing petabytes or exabytes of data in a data warehouse, It’s about the ability to make better decisions and take meaningful actions at the right time. •Fast forward to the present and technologies like Hadoop give you the scale and flexibility to store data before you know how you are going to process it. •Technologies such as MapReduce, Hive and Impala enable you to run queries without changing the data structures underneath.

Benefits of Big Data • Research finds that organizations are using big data to target customer-centric outcomes, tap into internal data and build a better information ecosystem. • Big Data is already an important part of the $64 billion database and data analytics market • It offers commercial opportunities of a comparable scale ➢ to enterprise software in the late 1980s ➢ the Internet boom of the 1990s ➢ the social media explosion of today.

Big data and privacy concerns

The growing globalization of data flows via big data increases the risk that people can lose control of their own data

Celia Fernández Aller ETSISI UPM

➢ “Big data technologies, together with the sensors that ride on the “Internet of Things,” pierce many spaces that were previously private”. (White House Report, 2014)

Market Forecast • Wikibon projects the Big Data market will top $84B in 2026, attaining a 17% Compound Annual Growth Rate (CAGR) for the forecast period 2011 to 2026. • The Big Data market reached $27.36B in 2014, up from $19.6B in 2013.

cloud computing

EBI-Cloud • The EBI is building a cloud-based infrastructure called Helix Nebula — The Science Cloud. • Clouds are a solution, but they also throw up fresh challenges • Harnessing powerful computers and numerous tools for data analysis is crucial in drug discovery and other areas of big-data biology

Non Relational (NoSQL) Databases • Relational databases have been a computing mainstay since the 1970s. Data is stored in rows and columns and can be accessed through SQL queries. • By way of contrast, non-relational (i.e., NoSQL) databases are relatively new (1998), can store data of any structure, and do not rely on SQL to retrieve data . • Data such as XML, text, audio, video, image, and application-specific document files are often stored and retrieved “as is”.

Hadoop/MapReduce • Of all the platforms and approaches to storing and analyzing big data, none is receiving more attention than Hadoop/MapReduce. • Its origins trace back to the early 2000s, when companies such as Google, Yahoo, and Facebook needed the ability to store and analyze massive amounts of data from the Internet.

Apache Hadoop • Apache Hadoop is a software framework for processing large amounts of data across potentially massively parallel clusters of servers. To illustrate, Yahoo has over 42,000 servers in its Hadoop installation. • The Hadoop infrastructure typically runs MapReduce programs (using a programming or scripting language such as Java, Python, C, R, or Perl) in parallel. • MapReduce takes large datasets, extracts and transforms useful data, distributes the data to the various servers where processing occurs, and assembles the results into a smaller, easier to analyze file. • It does not perform analytics per se; rather, it provides the framework that controls the programs that perform the analytics.

Bioinformatics applications • Alu Sequence Classification is one of the most challenging problems for sequence clustering because Alus represent the largest repeat families in human genome. • EST (Expressed Sequence Tag) corresponds to messenger RNAs transcribed from genes residing on chromosomes. Each individual EST sequence represents a fragment of mRNA, and the EST assembly aims to reconstruct full-length mRNA sequence for each expressed gene.

The distinction between big data technologies and cloud computing

• - cloud computing is often used to facilitate the cost effective storage of such large datasets. - big data technologies are often offered as Platform as a Service (PaaS) within a cloud environment. - These technologies, while often coinciding, are distinct and can operate mutually exclusively

MapReduce • One of the first MapReduce projects applied in the biotechnology space resulted in the Genome Analysis Tool Kit (GATK) . • CloudBurst was one of the first of these, developed by Michael Schatz et al. • Crossbow for SNP identification, using Hadoop’s massive sort engine to order the alignments along the genome and then genotyping sample using SoapSNP. • As input it aligns a mix of 3 billion paired-end and unpaired reads, equivalent to 110 GB of compressed sequence data, and as output it catalogues all the SNPs in the genome. • CrossBow can genotype a human in approximately 3 hours on a 320 core cluster, discovering 3.7 million SNPs at >99% accuracy for $100 (including data transfer comprising an hour) using AWS EC2.

• Unfortunately, most bioinformatics based applications are difficult to build, configure and maintain, primarily because they are, for the most part, open source in nature, lacking good documentation and require many programming library dependencies. • As a result, this requires an advanced level of technical expertise on the behalf of the biologist and, as such, is a common bottleneck in the adoption of bioinformatics based applications. • However, as all software applications are installed and configured within the VM, SaaS provides the perfect solution.

Cloud BioLinux • Cloud BioLinux, created by the J. Craig Venter Institute (JCVI) is an example of SaaS. • It is a publicly accessible virtual machine that is stored at Amazon EC2, is freely available to EC2 users. • User friendly Graphical User Interface (GUI), along with over 100 pre-installed bioinformatics tools including Galaxy , BioPerl, BLAST, Bioconductor, Glimmer, GeneSpring, ClustalW and EMBOSS utilities, amongst others. • While Linux based Bioinformatics distributions such DNALinux, BioSlax BioKnoppix, DebianMed, are not unusual, they are built to run on standalone local machines. • SaaS initiatives such as BioLinux have been known to be referred to as Science as a Service (ScaaS).

SaaS VM • Another significant advantage of using such SaaS VM images on a public cloud, such as Amazon, is that Amazon provides access to several large genomic data sets including • the 1000 Genome project, • NCBI, GenBank and Ensembl. • CloVR provides a similar image with pre-installed packages. • Standalone bio/medical software applications/suites with a cloud backend include Taverna , FX ,SeqWare ,BioVLab , and commercial equivalents such as DNAnexus .

Parallelised big data technologies and genomics

Genomics research is the the dream use case for big data technologies which, if unified, are likely to have a profoundly positive impact on mankind

Privacy Concern--Accoding to some experts, the ability to protect medical and genomics data in the era of big data and a changing privacy landscape is a growing challenge. While cloud computing is championed as a method for handling such big data sets, it’s perceived insecurity is viewed by many as a key inhibitor to its widespread adoption in the commercial life sciences sector.

• One of the greatest scientific and engineering challenges of the 21st century will be to understand and make effective use of this growing body of information. • Visual data analysis, facilitated by interactive interfaces, enables the detection and validation of expected results while also enabling unexpected discoveries in science

Data processing and resource management • MapReduce is one of the most popular programming models to process large amounts of data on clusters of computers. • Hadoop is the most used open source MapReduce implementation, also made available by several Cloud providers . • Amazon EMR enables customers to instantiate Hadoop clusters to process large amounts of data using the Amazon Elastic Compute Cloud (EC2) and other Amazon Web Services for data storage and transfer.

Bioinformatics

Bioinformatics topics are all linked together

In post genomic era a new language has been created for new biology

• • • • •

Genomics Functional Genomics Proteomics cDNA microarrays Global Gene Expression Patterns

June 26, 2000 at the Whitehouse

New Computational Tools are Needed • • • • • • • •

Sequencing Analyzing experimental data Representing vast quantities of information Searching Pattern matching Data mining Gene discovery Function discovery

Goals of bioinformatics • • • • • • •

Classify Identify patterns/ Pattern Recognition Make predictions Data Modelling,Creation of models & Prediction Assessment and Comparison Optimization Better utilize existing knowledge

High-throughput techniques High throughput protein crystalization Massive parallel sequencing

Mass spectrometry Microarray High throughput cell imaging High throughput in vivo screening …

How to extract the information? Computational tools • Building the databases • Perform analysis/extract features • Data mining • Classification/statistical learning • Visualization/representation

Biological information!!!

PG Era-High throughput DNA sequencing Centre

Sequencing factories

Custom-designed factory-style conveyor belt robots: perform all functions from purifying DNA from bacterial cultures through setting up and purifying sequencing reactions.

Whitehead institute

Automated sequencing

DNA isolation

Mass spectrometer

Beckman Biomek FX

Sequencing

Affymetrix gene expression

Role of bioinformatics

Genomics

RNA

Transcriptomics

protein metabolites

Proteomics Metabolomics Integrative/System Biology

Data usage/user interfacing

DNA

Bioinformatics

Data integration/fusion

methodology

Data generation/validation

cell

What is Genomics? • Genome – complete set of genetic instructions for making an organism

• Genomics – any attempt to analyze or compare the entire genetic complement of a species – Early genomics was mostly recording genome sequences

Genomics • Classic Genomics • Post Genomic era – Comparative Genomics – Functional Genomics – Structural Genomics

Data Generation in Genomics • The astonishing rate of data generation by lowcost, high-throughput technologies in genomics is being matched by that of other technologies, such as real-time imaging and mass spectrometry-based flow cytometry. • Success in the life sciences will depend on our ability to properly interpret the large-scale, highdimensional data sets that are generated by these technologies, which in turn requires us to adopt advances in informatics

1024 Yottabytes

1021

Theater Data Stream (2006): ~270 TB of NTM data / year Example: One Theater’s Storage Capacity:

Zettabytes

1018

250 TB

Exabytes

Large Data JCTD

12 TB 2006

Capability Gap

2010

1015 Petabytes

UUVs FIRESCOUT VTUAV DATA

1012 Terabytes

GIG Data Capacity (Services, Transport & Storage) 2000

Today

2010

2015 & Beyond

Bob Gourley http://ctovision.typepad.com/InfoSharingTechnologyFutures.ppt

Current Situation • Soon it will cost less to sequence a base of DNA than to store it on hard disk • We are facing potential tsunami of genome data that will swamp our storage system and crush our compute clusters • Who is affected • Genome sequence repositories who need maintain and timely update their data • Power users who accustomed to download the data to local computer for analysis

The research in omics sciences is moving from a hypothesis-driven to a data-driven approach • Efficient analysis and interpretation of Big Data opens new avenues to explore molecular biology • New paradigms are needed to store and access data, for its annotation and integration and finally for inferring knowledge and making it available to researchers. • Bioinformatics can be viewed as the “glue” for all these processes. • A clear awareness of present high performance computing (HPC) solutions in bioinformatics is need of the day • Big Data analysis paradigms for computational biology, and the issues that are still open in the biological Sciences .

• With the introduction of high-throughput sequencing platforms, it is becoming feasible to consider sequencing approaches to address many research projects. • Knowing how to manage and interpret the large volume of sequence data resulting from such technologies is less clear.

Computational solutions to large-scale data management and analysis • High-throughput sequencing, • large-scale data generation projects, • web-based cloud computing, are changing how computational biology is performed.

Modeling genomic regulatory networks with big data – Current models of the regulation of gene expression are simplistic and need updating. – Integrative analysis of large datasets is now a fundamental part of molecular biology. – Sophisticated computational tools are becoming available, requiring ongoing education.

Big Data in Life sciences

IMPACT OF BIG DATA IN LIFE SCIENCES and HEALTH CARE Big data can also be seen in the life sciences where big sets of data such as ✓genome sequencing, ✓clinical data and patient data are analyzed and used to advance breakthroughs in science in research.

Human Genome Project In biomedicine the Human Genome Project is determining the sequences of the three billion chemical base pairs that make up human DNA. EBI is building a cloud-based infrastructure called Helix Nebula — The Science Cloud.

Genome biology and the study of cell circuits -the most important Big Data challenges of our time. • A fundamental Big Data development in the field of healthcare is the ambition of the Broad Institute, an initiative of MIT and Harvard, to expand the Human Genome Project, which was eventually rounded off in 2003. • Broad Institute is currently researching the cell mutations that cause cancer, the molecular structure of the viruses, bacteria etc. that are responsible for infectious illnesses, and the possibilities of their use in the development of medicines.

Digital healthcare data expected to reach 25,000 petabytes in 2020 • Standard medical practice is moving from relatively ad-hoc and subjective decision making to evidence-based healthcare. • Healthcare Data is becoming more complex • In 2012, worldwide digital healthcare data was estimated to be equal to 500 petabytes

Global Viral Forecasting Initiative (GVFI) • The San Francisco-based Global Viral Forecasting Initiative (GVFI) uses advanced data analysis on information mined from the Internet to identify comprehensively the locations, sources and drivers of local outbreaks before they become global epidemics- GVFI‟s Chief Innovation Officer, Lucky Gunasekara, says. • Public health offers one of the most compelling areas where the analysis of mobile and Internet data could lead to huge public gains.

➢ “Big data technologies, together with the sensors that ride on the “Internet of Things,” pierce many spaces that were previously private”. (White House Report, 2014)

Big Data is developing most rapidly in the world of Big Science • In 10 years,2800 radio telescopes in the Square Kilometer Area project (ska), the largest Big Science project ever, will generate 1 billion gigabytes of data daily. That is equal to the entire Internet on a weekday in 2012. • In 2012, Google processes a total of 5 petabytes or 5000 terabytes per hour.