Supports Java (6+), Scala (2.10+) and Python (2.6+). ⢠Runs on top ... Java, Scala, and Python APIs. ⢠Resilient .... "Clash of the Titans: MapReduce vs. Spark for.
Apache Spark Cluster Compu2ng Pla6orm
ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Spark Introduc.on - Apache Spark is a open source, fast and general-purpose cluster compu2ng pla6orm - parallel distributed processing - fault tolerance - on commodity hardware - Originally developed at UC Berkeley AMP Lab, 2009 - Open sourced in March 2010 - Apache SoOware Founda2on, 2013 - WriSen in Scala - Runs on the JVM
ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Insight Data Labs, December 2015
Spark Introduc.on - Deployed at massive scale, mul2ple petabytes of data - Clusters of over 8,000 nodes - Yahoo, Baidu, Tencent, …....
ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Insight Data Labs, December 2015
Spark Features • Spark is a general computa2on engine that uses distributed memory to perform fault-tolerant computa2ons with a cluster • Speed • Ease of use • Analy2c • Environments that require • Large datasets • Low latency processing • Spark can perform itera2ve computa2ons at scale (in memory) which opens up the possibility of execu2ng machine learning algorithms much faster than with Hadoop MR (disk-based)[2] [4]. ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Spark Features • Computa2onal engine: • Scheduling • Distribu2ng • Monitoring applica2ons consis2ng of many computa2onal tasks across a computa2onal cluster. • From an engineering perspec2ve Spark hides the complexity of: • distributed systems programming • network communica2on • and fault tolerance. ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Spark Features • Spark contains mul2ple closely integrated components, designed to interoperate closely, and be combined as libraries in a soOware project. • Supports Java (6+), Scala (2.10+) and Python (2.6+) • Runs on top of Hadoop, Mesos*[1], Standalone or in the cloud • Access diverse data sources: HDFS, Cassandra, Hbase [2] • Supports SQL queries • Machine Learning algorithms • Graph processing • Stream processing • Sensor data processing • A general cluster manager, provides APIs for resource management and scheduling across datacenter and cloud environments (www.mesos.apache.org). Can run Hadoop MR and service applica2ons.
ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Spark Ecosystem
ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
hSps://www.safaribooksonline.com/library/view/learning-spark/
Spark Ecosystem Spark Core The execu2on engine for the Spark pla6orm that all other func2onality is built on top of. Contains the basic func2onality of Spark [5]: • in-memory compu2ng capabili2es • • • • • •
memory management components for task scheduling fault recovery interac2ng with storage systems Java, Scala, and Python APIs Resilient Distributed Datasets (RDDs) API
ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Spark Ecosystem § Resilient Distributed Datasets (RDDs) • Spark main programming abstrac2on for working with data. • RDDs represent a fault-tolerant collec2on of elements distributed across many compute nodes that can be manipulated in parallel. • Spark Core provides many APIs for building and manipula2ng these collec2ons. • All work is expressed as • crea2ng new RDDs • transforming exis2ng RDDs – return pointers to RDDs • ac2ons, calling opera2ons on RDDs - return values Ex: val textFile = sc.textFile("README.md") textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Spark Ecosystem § Resilient Distributed Datasets (RDDs) • crea2ng new RDDs scala> val textFile = sc.textFile("README.md") textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
• transforming exis2ng RDDs – return pointers to RDDs Ex: filter transformaLon to return a new RDD with a subset of the items in the file. scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
• ac2ons, calling opera2ons on RDDs - return values scala> textFile.count() // Number of items in this RDD
res0: Long = 126
scala> textFile.first() // First item in this RDD
res1: String = # Apache Spark
ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Spark Ecosystem Spark SQL • Spark SQL is a Spark module for structured data processing • Allows querying data via SQL as well as HQL (Hive Query Language) • Act as distributed SQL query engine • Extends the Spark RDD API • It provides DataFrames – a DataFrame is equivalent to a rela2onal table in Spark SQL. • It also provides powerful integra2on with the rest of the Spark ecosystem (e.g., integra2ng SQL query processing with machine learning) • It enables unmodified Hadoop Hive queries to run up to 100x faster on exis2ng deployments and data
ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Spark Ecosystem Spark SQL - DataFrames val sc: SparkContext // An exis2ng SparkContext. val sqlContext = new org.apache.spark.sql.SQLContext(sc) // Create the DataFrame val df = sqlContext.read.json("examples/src/main/resources/people.json") // Displays the content of the DataFrame to stdout df.show() // age name // 20 Michael // 30 Andy // 19 JusLn
ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Spark Ecosystem Spark SQL - DataFrames // Select only the "name" column df.select("name").show() // name // Michael // Andy // JusLn df.filter(df("age") > 21).show() df.groupBy("age").count().show() ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Spark Ecosystem Spark Streaming • Spark component that provides the ability to process and analyze live streams of data in real-2me. • Web logs, online posts and updates from web services, logfiles, etc. • Enables powerful interac2ve and analy2cal applica2ons across both streaming and historical data • Integrates with a wide variety of popular data sources, including HDFS, Flume, Kapa, and TwiSer • API for manipula2ng data streams
ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Spark Ecosystem Spark Mllib – Machine Learning • MLlib is a scalable machine learning library that delivers both high-quality algorithms (e.g., mul2ple itera2ons to increase accuracy) and speed (up to 100x faster than MapReduce). • Provides mul2ple types of machine learning algorithms, including classifica2on, regression, clustering, and collabora2ve filtering, as well as suppor2ng func2onality such as model evalua2on and data import. • The library is usable in Java, Scala, and Python as part of Spark applica2ons, so that can be included in complete workflows
ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Spark Ecosystem Spark GraphX – Graph Computa2on • It is a library for manipula2ng graphs • Performs graph-parallel computa2ons • GraphX is a graph computa2on engine built on top of Spark that enables users to interac2vely build, transform and reason about graph structured data at scale • GraphX extends the Spark RDD API • Provides a library of graph algorithms (e.g., PageRank and triangle coun2ng)
ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Spark Ecosystem Cluster Managers • Spark is designed to scale up from one to many thousands of compute nodes • Runs on diverse cluster managers: • Hadoop YARN • Apache Mesos [1] • Standalone Scheduler – a simple cluster manager included in Spark
ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
References [1] Apache Mesos. hSp://spark.apache.org/docs/1.3.0/cluster-overview.html [2] ApacheHadoop. hSp://hadoop.apache.org/. [3] ApacheMahout. hSps://mahout.apache.org/. [4] Shi, Juwei; Qiu, Yunjie et all. "Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analy2cs.”. IBM Research China, IBM Almadem Research Center, Renmin University of China. [5] Safari books online hSps://www.safaribooksonline.com/library/view/ learning-spark/.
ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.