Building Scalable Big Data Infrastructure Using Open Source ...
Recommend Documents
Keywords: Big data, open source software, data science, programming, IT software ..... support systems: Putting analytics and big data in cloud.Decision Support ...
Dec 12, 2017 - Our proposed data processing pipeline is based on. Apache Kafka for data ingestion, Apache Spark for in-memory ... messages are broadcasted to all subscribed consumers. Rab- ... ment analysis application processes each tweet and classi
3 Cf. par exemple: The Great Disk Drive in the Sky: How Web giants store bigâand we .... semblables, et les mêmes out
have been developed and published as open-source software. ... Fortunately, there are many articles and surveys that evaluate or ..... Searching capabilities provided for defined sections in a document (title, chapter, paragraph) or in the whole ...
Official Full-Text Paper (PDF): Building Reconfigurable Systems Using Open Source Components. ... http://cdn.opencores.org/downloads/. wbspec_b4.pdf, 2010 ...
Presenter information. Tomas Kirnak. Network design. Security, wireless. Servers
. Virtualization. MikroTik Certified Trainer. Atris, Slovakia. Established 1991.
Jan 1, 2014 - FIGURE 4. Use of Specialized BDA Information Management Software ... big data requirements into architectu
between HEDIIP and Data Futuresâbut retained its independence through oversight by a separate ... The Data Futures programme itself commenced in 2016, with a four-year timeline for ...... Austin, Texas: The New Media Consortium. Bartlett ...
Apr 2, 2018 - and aggregate data from the data sources, which are then stored in a ... Several data sources are needed to get an overview over the community-driven data ...... F, Purves R, editors. European Handbook of Crowdsourced.
Keywords: Machine Learning, Big Data, Hadoop, MapReduce, GPU. 1 Introduction. Machine ..... Hadoop neural network for parallel and distributed feature ...
To capitalize on the Big Data trend, a new breed of Big Data technologies (such ... harnessing the emergent Big Data to
Nov 22, 2013 - large-scale data by taking advantages from both MapReduce (in ... practicality of HaCube for cube analysis over a large amount of data.
May 27, 2015 - Capacity Development is key to Growth of Open Data Initiatives. ..... web applications like Eduweb8 and F
[email protected]. Abstract: In this paper, we bring forward a comparative study between the revolutionary enterprise big data analytical tools and.
Jul 4, 2016 - Apache Spark with extension for spatial functions (e.g.Spark Lidar). PostGIS, LasTools, rLiDAR, Geo-. Plus, Grass GIS- LiDAR Tools text based.
fault tolerance and disaster recovery. These concepts .... It pushes billions of messages per second in a reliable way t
platforms for big data analytics and compare them based on computing environment ... Keywords: Big data, enterprise, open source, analytical tools,. Hadoop ...
Abstract: In this paper, we bring forward a comparative study between the revolutionary enterprise big data analytical tools and the open source tools for the ...
Convergence or the 'blending' of data management platforms is being driven by ... delivered on a unified platform, provi
Aug 27, 2012 ... system, called ASTERIX, as our approach to addressing today's ... of the
ASTERIX stack is a data-intensive runtime called Hyracks [4]. ∗now at ...
open source platforms for Big Data Analytics â Mahout, MOA, R Project, Vowpal Wabbit, Pegasus, GraphLab .... 1) Big Data Basics that usually represent the.
By 2019 it is anticipated that the value created in the Hadoop, Event/Stream processing, New SQL and NoSQL market will b
Dec 1, 2015 - Scalable analysis of Big pathology image data cohorts using ... The analysis component based on high performance computing can carry out ...
Building Scalable Big Data Infrastructure Using Open Source ...
Apache Kafka. • Distributed pub-sub system. • Developed @ LinkedIn. • Offers
message persistence. • Very high throughput. – ~300K messages/sec.
Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon.
What is StumbleUpon? Help users find content they did not expect to find
The best way to discover new and interesting things from across the Web.
How StumbleUpon works 1. Register
2. Tell us your interests
3. Start Stumbling and rating web pages
We use your interests and behavior to recommend new content for you!
StumbleUpon
The Data Challenge 1
Data collection
2
Real time metrics
3
Batch processing / ETL
4
Data warehousing & ad-hoc analysis
5
Business intelligence & Reporting
Challenges in data collection • Different services deployed of different tech stacks Site
Rec /Ad Server
Apache / PHP
Scala / Finagle
Other internal services
Java / Scala / PHP
• Add minimal latency to the production services • Application DBs for Analytics / batch processing – From HBase & MySQL
Data Processing and Warehousing Raw Data
Challenges/Requirements: E T L
•Scale over 100 TBs of data •End product works with easy querying tools/languages
Warehouse (HDFS)
Tables
•Reliable and Scalable ᾶ powers analytics and internal reporting. Massively Denormalize d Tables
Real-time analytics and metrics • • • • •
Atomic counters Tracking product launches Monitoring the health of the site Latency – live metrics makes sense A/B tests
Open Source at SU
Data Collection at SU
Activity Streams and Logs
All messages are Protocol Buffers Fast and Efficient Multiple Language Bindings ( Java/ C++ / PHP ) Compact Very well documented Extensible
Apache Kafka • • • •
Distributed pub-sub system Developed @ LinkedIn Offers message persistence Very high throughput – ~300K messages/sec
• Horizontally scalable • Multiple subscribers for topics . – Easy to rewind
Kafka • Near real time process can be taken offline and done at the consumer level • Semantic partitioning through topics • Partitions for parallel consumption • High-level consumer API using ZK • Simple to deploy- only requires Zookeeper
Kafka At SU • • • • •
4 Broker nodes with RAID10 disks 25 topics Peak of 3500 msg/s 350 bytes avg. message size 30 days of data retention
Sutro • • • • • •
Scala/Finagle Generic Kafka message producer Deployed on all prod servers Local http daemon Publishes to Kafka asynchronously Snowflake to generate unique Ids
Sutro - Kafka Site Apache/PHP
Ad Server Scala/Finagle
Rec Server Scala/Finagle
Other Services
Sutro
Sutro
Sutro
Sutro
Kafka Broker
Broker Broker
Application Data for Analytics & Batch Processing HBase • HBase inter-cluster replication (from production to batch cluster) • Near real-time sync on batch cluster • Readily available in Hive for analysis
MySQL • MySQL replication to Batch DB Servers • Sqoop incremental data transfer to HDFS • HDFS flat files mapped to Hive tables & made available for analysis
A distributed time-series DB on HBase Collects over 2 Billion data points a day Plotting time series graphs Tagged data points
Real-time counters
Real time metrics from OpenTSDB
Kafka Consumer framework aka Postie • • • • • • •
Distributed system for consuming messages Scala/Akka -on top of Kafka’s consumer API Generic consumer - understands protobuf Predefined sinks HBase / HDFS (Text/Binary) / Redis Consumers configured with configuration files Distributed / uses ZK to co-ordinate Extensible
Postie
Akka • Building concurrent applications made easy !! • The distributed nodes are behind Remote Actors • Load balancing through custom Routers • The predefined sink and services are accessed through local actors • Fault-tolerance through actor monitoring
Postie
Batch processing / ETL GOAL: Create simplified data-sets from complex data Batch processing / ETL
Create highly denormalized data sets for faster querying
Output structured data for specific analysis e.g. Registration Flow analysis
Power the reporting DB with daily stats
Our favourite ETL tools: • Pig – – – – –
Optional Schema Work on byte arrays Many simple operations can be done without UDFs Developing UDFs is simple (understand Bags/Tuples) Concise scripts compared to the M/R equivalents
• Scalding – – – –
Functional programming in Scala over Hadoop Built on top of Cascading Operating over tuples is like operating over collections in Scala No UDFs .. Your entire program is in a full-fledged general purpose language
Warehouse - Hive Uses SQL-like querying language All Analysts and Data Scientists versed in SQL Supports Hadoop Streaming (Python/R) UDFs and Serdes make it highly extensible Supports partitioning with many table properties configurable at the partition level
Hive at StumbleUpon
HBaseSerde
• Reads binary data from HBase • Parses composite binary values into multiple columns in Hive (mainly on key)
ProtobufSerde
• For creating Hive tables on top of binary protobuf files stored in HDFS • Serde uses Java reflection to parse and project columns
Data Infrastructure at SU
Data Consumption
End Users of Data Data Pipeline
Who uses this data? •Data Scientists/Analysts •Offline Rec pipeline •Ads Team
Warehouse
Ὴ all this work allows them to focus on querying and analysis which is critical to the business.
Business Analytics / Data Scientists • Feature-rich set of data to work on • Enriched/Denormalized tables reduce JOINs, simplifies and speeds queries – shortening path to analysis. • R: our favorite tool for analysis post Hadoop/Hive.
Recommendation platform • URL score pipeline – M/R and Hive on Oozie – Filter / Classify into buckets – Score / Loop – Load ES/HBase index
• Keyword generation pipeline – Parse URL data – Generate Tag mappings
• Audience Estimation tool – Pre-crunched data into multiple dimensions – A UI tool for Advertisers to estimate target audience
• Sales team tools – Built with PHP leveraging Hive or pre-crunched ETL data in HBase
More stuff on the pipeline • Storm from Twitter – Scope for lot more real time analytics. – Very high throughput and extensible – Applications in JVM language
• BI tools – Our current BI tools / dashboards are minimal – Google charts powered by our reporting DB (HBase primarily).
Open Source FTW!! • • • •
Actively developed and maintained Community support Built with web-scale in mind Distributed systems – Easy with Akka/ZK/Finagle • Inexpensive • Only one major catch !! – Hire and retain good engineers !!