2010 IBM Corporation. Optimizing Big Data with Search. Mark Myers. Sr. Director of Product marketing. Vivisimo, An IBM C
Optimizing Big Data with Search Mark Myers Sr. Director of Product marketing Vivisimo, An IBM Company
© 2010 IBM Corporation
Topics
Big Data opportunity Evolution of search Search, navigation and discovery for Big Data Deployment scenarios Case studies and examples
© 2010 IBM Corporation
What is Big Data?
What is Big Data? Definition varies … but you probably know it when you see it…
Data sets that are: •Too large (volume) •Too fast-moving (velocity) •Too diverse (variety) … for “conventional” data management tools Can be structured, unstructured, or semistructured
© 2010 IBM Corporation
“Data isData the New – Oil”Big Potential Big “Data is the new Oil. Data is just like crude. It’s valuable, but if unrefined it cannot really be used.” – Clive Humby, DunnHumby
“We have for the first time an economy based on a key resource [Information] that is not only renewable, but self-generating. Running out of it is not a problem, but drowning in it is.” – John Naisbitt © 2010 IBM Corporation 4
1 Cloud Big Data is one of five major trends in IT.
2 Mobile 5 3 Social 4 Consumerization
5
Big Data
The other four trends are feeding big data
Government is near the top in volume of data
Stored Data Per Agency (2009): 1,313 TB
Source: US Bureau of Labor Statistics; McKinsey Global Institute analysis
© 2010 IBM Corporation
Big Data applications in government
Government Intelligence analysis Law enforcement and investigation Cyber security / cyber warfare Trend forecasting Logistics Physical sciences … many more
© 2010 IBM Corporation
Many organizations are losing ground
Amount of data being produced by an organization
Percentage of data that can be exploited
© 2010 IBM Corporation
SEARCH, DISCOVERY AND NAVIGATION FOR BIG DATA
© 2010 IBM Corporation
We’ve seen this before … search evolves at each stage
Ubiquitous Search Mobile
Enterprise search Faceted Navigation Clustering & Discovery Filtering/alerting E-commerce Natural-language
Platforms
processing Embedded search Proprietary online databases Boolean full-text and proximity search First full-text search systems
Web search
c ar e S
…
h
New Analytics
rch a e S
Web, E-business and SOA Se
Big Data
a rc h
Big Data Platform
Transaction Systems
WebSphere Mainframe, IMS and CICS
196010
Time 1990-
2010© 2010 IBM Corporation
Search today is ubiquitous and rich with features to connect people and machines with information Keyword “Natural language”/ semantics Clustering and tag clouds Faceted navigation Virtual documents Recommendation Filtering / alerting Federation Autocomplete Speech
… Delivered across many devices and applications
…
… Portals
Web
Enterprise Apps
Mobile
Commerce
Media
Social © 2010 IBM Corporation
Challenges emerge when applied to an entire enterprise Internal Safety Documents Third-Party Research
Da er v r Se
Prior Work
se ba a t
Historical Abstracts
L SQ gs nt o eme Bl is g a n a k Wi nt M u me c o D e) xchang Email (E
Vivisimo Velocity
Critical Correspondence
Cloud Oracle Dom ain Int ran Sh et De areP oin sk to t p
CRM
ER
Federated Sources
P
Users
Competitive Intelligence
Commenting
Tagging
Rating
Shared Folders Collaboration
Social Tools Personal Documents Financial Mgt. System
© 2010 IBM Corporation
SEARCH, DISCOVERY AND NAVIGATION WITH BIG DATA
© 2010 IBM Corporation
Cost per byte increases with more refined applications
Data Warehouse & Analytics
Hadoop / Big Data Framework yte B er P st o C
lue a /V
Search
© 2010 IBM Corporation
Big data adds another layer of challenge
Disk space has increased massively but speed to read / write has not; same with seek time – 1TB drive w/ 100MB/s = ~2.5 hours to read all data from disk More hardware means greater chance a single piece will fail Analytics need to be able to combine the data in some way; often require preprocessing and caching
© 2010 IBM Corporation
Tenets of Big Data Processing
Distributed Processing – Ability to distribute and processing across a network of nodes, and re-assemble the results. Analysis takes place where the data is stored. Fault Tolerance – Failure of a particular node should not bring down whole system; if a node fails and can be restored, it should be able to re-join the group activity without introducing inconsistencies. Linear Scalability – Adding computing resources should increase speed and performance in a linear fashion. Graceful Load Response – Increased load should not cause failure, but rather graceful decline in performance. Elasticity of Resources – Readily expand or contract to match the workload at a given time.
Search platform needs to match these demands to function in a Big Data environment © 2010 IBM Corporation
Search Platform Architecture Application
Application
Application
Application Framework
User Profiles
Federated Sources
Authentication/ Authorization Query Transformation Personalization Display
Search Engine
Text Analytics Thesaurus Clustering Ontology Support Semantic Processing Entity Extraction Relevancy
CM, RM, DM
RDBMS
Subscriptions Feeds Web Results Other Apps
Meta Data Faceting Tagging Taxonomy Collaboration
Indexing Converting Crawling
Feeds
Web 2.0
Email
Web
CRM, ERP
File Systems
Connector Framework © 2010 IBM Corporation
Deployment/Integration Scenarios for Exploiting Big Data Velocity Platform
1. 2. 3. 4.
Rapid search, discovery and navigation Load data from enterprise applications into Big Data framework Index and search of Big Data analytics Leveraging Big Data Platform for bulk processing and analytics © 2010 IBM Corporation
Rapid search, discovery and navigation Application
Application
Application
Application Framework
User Profiles
Federated Sources
Authentication/ Authorization Query Transformation Personalization Display
Subscriptions Feeds Web Results
Rapid, near real-time access Immediate answers Pinpoint results Data fusion IBM Big Data Platform
Search Engine
Thesaurus Clustering Ontology Support Semantic Processing Entity Extraction Relevancy
CM, RM, DM
RDBMS
Meta Data Faceting Tagging Taxonomy Collaboration
Indexing Converting Crawling
Feeds
Web 2.0
Email
Web
CRM, ERP
Vivisimo Big Data Connectors
Text Analytics
Data
Data
Data
File Systems
Connector Framework © 2010 IBM Corporation
Load data from enterprise applications into Big Data framework Application
Application
Application
Application Framework
User Profiles
Federated Sources
Authentication/ Authorization Query Transformation Personalization Display
RDBMS
Meta Data Faceting Tagging Taxonomy Collaboration
Indexing Converting Crawling
Feeds
Web 2.0
Email
Web
CRM, ERP
Vivisimo Big Data Connectors
Thesaurus Clustering Ontology Support Semantic Processing Entity Extraction Relevancy
CM, RM, DM
IBM Big Data Platform Suscriptions Feeds Web Results
Search Engine
Text Analytics
Delivers enterprise content into big data framework for analytics and fusion
Analytics & Conversion
Data
Analytics & Conversion
Data
Analytics & Conversion
Data
Meta Data
Meta Data
Meta Data
File Systems
Connector Framework © 2010 IBM Corporation
Index and search of Big Data analytics Application
Application
Application Framework
User Profiles
Federated Sources
Authentication/ Authorization Query Transformation Personalization Display
Thesaurus Clustering Ontology Support Semantic Processing Entity Extraction Relevancy
RDBMS
Meta Data Faceting Tagging Taxonomy Collaboration
Indexing Converting Crawling
Feeds
Web 2.0
Email
Web
CRM, ERP
IBM Big Data Platform Vivisimo Big Data Connectors
Search Engine
Text Analytics
CM, RM, DM
Suscriptions Feeds Web Results
Ensures ability to access and use products of big data analytics in the future Fusion of big data analytics with enterprise data
Analytics & Conversion
Data
Analytics & Conversion
Data
Analytics & Conversion
Data
File Systems
Connector Framework © 2010 IBM Corporation
Leveraging Big Data Platform for bulk processing and analytics Application
Application
Application
Application Framework
User Profiles
Federated Sources
Authentication/ Authorization Query Transformation Personalization Display
Subscriptions Feeds Web Results
Leverage the framework for text analytics and metadata extraction Bulk processing of enormous volumes Fusion IBM Big Data Platform
Thesaurus Clustering Ontology Support Semantic Processing Entity Extraction Relevancy
CM, RM, DM
RDBMS
Meta Data Faceting BI Tagging Taxonomy Collaboration
Indexing Converting Crawling
Feeds
Web 2.0
Email
Web
Connector Framework
CRM, ERP
Vivisimo Big Data Connectors
Search Engine
Text Analytics
Analytics & Conversion
Data
Analytics & Conversion
Data
Meta Data
Analytics & Conversion
Data
Meta Data
Meta Data
File Systems © 2010 IBM Corporation
EXAMPLES AND CASE STUDIES © 2010 IBM Corporation
Federation across secure domains at massive scale
© 2010 IBM Corporation
Knowledge fusion and collaboration across more than 400,000 users
© 2010 IBM Corporation
Powerful social search to drive collaboration and knowledge-sharing
© 2010 IBM Corporation
Metadata Catalog
© 2010 IBM Corporation
Fusion of enterprise data and analytics – commercial
© 2010 IBM Corporation
nt o C
ed u in
© 2010 IBM Corporation
de 0 36
vie e gre
er m sto u c e h ft o w
© 2010 IBM Corporation
vie e re g de 0 36
t se s a n fa o w
© 2010 IBM Corporation
360 degree view of the citizen (conceptual prototype)
© 2010 IBM Corporation
Search across multiple silos
© 2010 IBM Corporation
National Archives and Records Administration – Electronic Records Administration Challenge: create a single access point and rich discovery environment for the permanent records of the United States Online Public Access prototype – Streamlined searching – Better results – Better presentation
© 2010 IBM Corporation
National Archives and Records Administration Projected Data Growth for Electronic Records Administration
© 2010 IBM Corporation
QUESTIONS & DISCUSSION
© 2010 IBM Corporation