Apache Spark

3 downloads 2043 Views 537KB Size Report
Supports Java (6+), Scala (2.10+) and Python (2.6+). • Runs on top ... Java, Scala, and Python APIs. • Resilient .... "Clash of the Titans: MapReduce vs. Spark for.
Apache Spark Cluster Compu2ng Pla6orm

ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.

Spark Introduc.on -  Apache Spark is a open source, fast and general-purpose cluster compu2ng pla6orm -  parallel distributed processing -  fault tolerance -  on commodity hardware -  Originally developed at UC Berkeley AMP Lab, 2009 -  Open sourced in March 2010 -  Apache SoOware Founda2on, 2013 -  WriSen in Scala -  Runs on the JVM

ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.

Insight Data Labs, December 2015

Spark Introduc.on -  Deployed at massive scale, mul2ple petabytes of data -  Clusters of over 8,000 nodes -  Yahoo, Baidu, Tencent, …....

ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.

Insight Data Labs, December 2015

Spark Features •  Spark is a general computa2on engine that uses distributed memory to perform fault-tolerant computa2ons with a cluster •  Speed •  Ease of use •  Analy2c •  Environments that require •  Large datasets •  Low latency processing •  Spark can perform itera2ve computa2ons at scale (in memory) which opens up the possibility of execu2ng machine learning algorithms much faster than with Hadoop MR (disk-based)[2] [4]. ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.

Spark Features •  Computa2onal engine: •  Scheduling •  Distribu2ng •  Monitoring applica2ons consis2ng of many computa2onal tasks across a computa2onal cluster. •  From an engineering perspec2ve Spark hides the complexity of: •  distributed systems programming •  network communica2on •  and fault tolerance. ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.

Spark Features •  Spark contains mul2ple closely integrated components, designed to interoperate closely, and be combined as libraries in a soOware project. •  Supports Java (6+), Scala (2.10+) and Python (2.6+) •  Runs on top of Hadoop, Mesos*[1], Standalone or in the cloud •  Access diverse data sources: HDFS, Cassandra, Hbase [2] •  Supports SQL queries •  Machine Learning algorithms •  Graph processing •  Stream processing •  Sensor data processing •  A general cluster manager, provides APIs for resource management and scheduling across datacenter and cloud environments (www.mesos.apache.org). Can run Hadoop MR and service applica2ons.

ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.

Spark Ecosystem

ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.

hSps://www.safaribooksonline.com/library/view/learning-spark/

Spark Ecosystem Spark Core The execu2on engine for the Spark pla6orm that all other func2onality is built on top of. Contains the basic func2onality of Spark [5]: •  in-memory compu2ng capabili2es •  •  •  •  •  • 

memory management components for task scheduling fault recovery interac2ng with storage systems Java, Scala, and Python APIs Resilient Distributed Datasets (RDDs) API

ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.

Spark Ecosystem §  Resilient Distributed Datasets (RDDs) •  Spark main programming abstrac2on for working with data. •  RDDs represent a fault-tolerant collec2on of elements distributed across many compute nodes that can be manipulated in parallel. •  Spark Core provides many APIs for building and manipula2ng these collec2ons. •  All work is expressed as •  crea2ng new RDDs •  transforming exis2ng RDDs – return pointers to RDDs •  ac2ons, calling opera2ons on RDDs - return values Ex: val textFile = sc.textFile("README.md") textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.

Spark Ecosystem §  Resilient Distributed Datasets (RDDs) •  crea2ng new RDDs scala> val textFile = sc.textFile("README.md") textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3

•  transforming exis2ng RDDs – return pointers to RDDs Ex: filter transformaLon to return a new RDD with a subset of the items in the file. scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))

•  ac2ons, calling opera2ons on RDDs - return values scala> textFile.count() // Number of items in this RDD

res0: Long = 126

scala> textFile.first() // First item in this RDD

res1: String = # Apache Spark

ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.

Spark Ecosystem Spark SQL •  Spark SQL is a Spark module for structured data processing •  Allows querying data via SQL as well as HQL (Hive Query Language) •  Act as distributed SQL query engine •  Extends the Spark RDD API •  It provides DataFrames – a DataFrame is equivalent to a rela2onal table in Spark SQL. •  It also provides powerful integra2on with the rest of the Spark ecosystem (e.g., integra2ng SQL query processing with machine learning) •  It enables unmodified Hadoop Hive queries to run up to 100x faster on exis2ng deployments and data

ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.

Spark Ecosystem Spark SQL - DataFrames val sc: SparkContext // An exis2ng SparkContext. val sqlContext = new org.apache.spark.sql.SQLContext(sc) // Create the DataFrame val df = sqlContext.read.json("examples/src/main/resources/people.json") // Displays the content of the DataFrame to stdout df.show() // age name // 20 Michael // 30 Andy // 19 JusLn



ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.

Spark Ecosystem Spark SQL - DataFrames // Select only the "name" column df.select("name").show() // name // Michael // Andy // JusLn df.filter(df("age") > 21).show() df.groupBy("age").count().show() ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.

Spark Ecosystem Spark Streaming •  Spark component that provides the ability to process and analyze live streams of data in real-2me. •  Web logs, online posts and updates from web services, logfiles, etc. •  Enables powerful interac2ve and analy2cal applica2ons across both streaming and historical data •  Integrates with a wide variety of popular data sources, including HDFS, Flume, Kapa, and TwiSer •  API for manipula2ng data streams

ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.

Spark Ecosystem Spark Mllib – Machine Learning •  MLlib is a scalable machine learning library that delivers both high-quality algorithms (e.g., mul2ple itera2ons to increase accuracy) and speed (up to 100x faster than MapReduce). •  Provides mul2ple types of machine learning algorithms, including classifica2on, regression, clustering, and collabora2ve filtering, as well as suppor2ng func2onality such as model evalua2on and data import. •  The library is usable in Java, Scala, and Python as part of Spark applica2ons, so that can be included in complete workflows

ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.

Spark Ecosystem Spark GraphX – Graph Computa2on •  It is a library for manipula2ng graphs •  Performs graph-parallel computa2ons •  GraphX is a graph computa2on engine built on top of Spark that enables users to interac2vely build, transform and reason about graph structured data at scale •  GraphX extends the Spark RDD API •  Provides a library of graph algorithms (e.g., PageRank and triangle coun2ng)

ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.

Spark Ecosystem Cluster Managers •  Spark is designed to scale up from one to many thousands of compute nodes •  Runs on diverse cluster managers: •  Hadoop YARN •  Apache Mesos [1] •  Standalone Scheduler – a simple cluster manager included in Spark

ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.

References [1] Apache Mesos. hSp://spark.apache.org/docs/1.3.0/cluster-overview.html [2] ApacheHadoop. hSp://hadoop.apache.org/. [3] ApacheMahout. hSps://mahout.apache.org/. [4] Shi, Juwei; Qiu, Yunjie et all. "Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analy2cs.”. IBM Research China, IBM Almadem Research Center, Renmin University of China. [5] Safari books online hSps://www.safaribooksonline.com/library/view/ learning-spark/.

ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.