Data Acquisition Webinar Slides.pdf

9 downloads 151 Views 3MB Size Report
RabbitMQ, SwiftMQ, Apache ActiveMQ,. Windows Azure Service Bus. ○ JMS 2.0. ○ Kestrel (Memcached). ○ Apache Kafka. ○ Apache Flume (log data).
Data Acquisition

Axel Ngonga Lead Data Acquisition BIG Data PPF http://big-project.eu

Motivation ● Increasing amout of data ○ ○ ○ ○

4K new pictures on Instagram 100K tweets 800K new pieces of content on Facebook …

Motivation

Motivation ● Big data technologies for ○ ○ ○ ○

Improved business intelligence Secure decisions Customized services …

● Use Cases ○ ○ ○ ○ ○

Mission planning Trade market Customized services Criminality prediction ...

Definition ● Data acquisition stands for ○ Selecting of data sources ○ Collection of information from these sources ○ Filtering and cleaning data

Overview DS

DS

DS

DS

Processing (cleaning, classification)

Storage

More than 3 Vs ● The 9(?) Vs of Big Data Acquisition ○ ○ ○ ○ ○ ○ ○ ○ ○

Volume Velocity Variety Vocabulary Variability (security models, ownership) Veracity (trustworthiness of data) Visibility (integrated view of data) Value (worth of data for data consumer) Visualization

Requirements ● ● ● ● ● ● ●

Extensibility of protocols High scalability of approaches Low memory consumption Parallelism Elasticity Fast ROI High throughput (real-time)

Technology Overview ● Gathering ○ Advanced Message Queuing Protocol ■ Wire-level protocol ■ OASIS Standard since Oct. 2012 ■ Large number of implementations incl. RabbitMQ, SwiftMQ, Apache ActiveMQ, Windows Azure Service Bus ○ JMS 2.0 ○ Kestrel (Memcached) ○ Apache Kafka ○ Apache Flume (log data) ○ FB Scribe (log data)

Technology Overview ● Processing ○ Facebook Scribe (Aggregation) ○ Twitter Storm (Stream Data Processing, Analysis) ○ MOA (Massive Online Analysis, esp. classification) ○ Hadoop (Distributed Processing) ○ InfoSphere Streams (Analysis)

Technology Overview ● Storage ○ ○ ○ ○ ○

MongoDB (BSON) Apache CouchDB (JSON) Neo4J (Graph DB) Oracle NoSQL IBM DB2 NoSQL

● Holistic Frameworks ○ Oracle's Big Data Suite ○ IBM's Big Data Suite ○ Karmasphere

Tool Matrix

Simple Recipe 1. Which of the 9Vs are important for me? 2. What are my sources? ○ ○ ○ ○

Protocols Velocity Type of data (logs, XML, …) ...

3. What’s my current storage architecture? ○ NoSQL? ○ Distributed?

Thank You! Questions? Axel Ngonga University of Leipzig AKSW Research Group [email protected] http://aksw.org/AxelNgonga http://big-project.eu

Questionnaire