Supports Java (6+), Scala (2.10+) and Python (2.6+). ⢠Runs on top ... Java, Scala, and Python APIs. ⢠Resilient .... "Clash of the Titans: MapReduce vs. Spark for.
Apache Spark Cluster Compu2ng Pla6orm
ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Spark Introduc.on - Apache Spark is a open source, fast and general-purpose cluster compu2ng pla6orm - parallel distributed processing - fault tolerance - on commodity hardware - Originally developed at UC Berkeley AMP Lab, 2009 - Open sourced in March 2010 - Apache SoOware Founda2on, 2013 - WriSen in Scala - Runs on the JVM
ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Insight Data Labs, December 2015
Spark Introduc.on - Deployed at massive scale, mul2ple petabytes of data - Clusters of over 8,000 nodes - Yahoo, Baidu, Tencent, …....
ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Insight Data Labs, December 2015
Spark Features • Spark is a general computa2on engine that uses distributed memory to perform fault-tolerant computa2ons with a cluster • Speed • Ease of use • Analy2c • Environments that require • Large datasets • Low latency processing • Spark can perform itera2ve computa2ons at scale (in memory) which opens up the possibility of execu2ng machine learning algorithms much faster than with Hadoop MR (disk-based)[2] [4]. ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Spark Features • Computa2onal engine: • Scheduling • Distribu2ng • Monitoring applica2ons consis2ng of many computa2onal tasks across a computa2onal cluster. • From an engineering perspec2ve Spark hides the complexity of: • distributed systems programming • network communica2on • and fault tolerance. ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Spark Features • Spark contains mul2ple closely integrated components, designed to interoperate closely, and be combined as libraries in a soOware project. • Supports Java (6+), Scala (2.10+) and Python (2.6+) • Runs on top of Hadoop, Mesos*[1], Standalone or in the cloud • Access diverse data sources: HDFS, Cassandra, Hbase [2] • Supports SQL queries • Machine Learning algorithms • Graph processing • Stream processing • Sensor data processing • A general cluster manager, provides APIs for resource management and scheduling across datacenter and cloud environments ( Can run Hadoop MR and service applica2ons.
ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Spark Ecosystem
ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Spark Ecosystem Spark Core The execu2on engine for the Spark pla6orm that all other func2onality is built on top of. Contains the basic func2onality of Spark [5]: • in-memory compu2ng capabili2es • • • • • •
memory management components for task scheduling fault recovery interac2ng with storage systems Java, Scala, and Python APIs Resilient Distributed Datasets (RDDs) API
ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Spark Ecosystem § Resilient Distributed Datasets (RDDs) • Spark main programming abstrac2on for working with data. • RDDs represent a fault-tolerant collec2on of elements distributed across many compute nodes that can be manipulated in parallel. • Spark Core provides many APIs for building and manipula2ng these collec2ons. • All work is expressed as • crea2ng new RDDs • transforming exis2ng RDDs – return pointers to RDDs • ac2ons, calling opera2ons on RDDs - return values Ex: val textFile = sc.textFile("") textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Spark Ecosystem § Resilient Distributed Datasets (RDDs) • crea2ng new RDDs scala> val textFile = sc.textFile("") textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
• transforming exis2ng RDDs – return pointers to RDDs Ex: filter transformaLon to return a new RDD with a subset of the items in the file. scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
• ac2ons, calling opera2ons on RDDs - return values scala> textFile.count() // Number of items in this RDD
res0: Long = 126
scala> textFile.first() // First item in this RDD
res1: String = # Apache Spark
ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Spark Ecosystem Spark SQL • Spark SQL is a Spark module for structured data processing • Allows querying data via SQL as well as HQL (Hive Query Language) • Act as distributed SQL query engine • Extends the Spark RDD API • It provides DataFrames – a DataFrame is equivalent to a rela2onal table in Spark SQL. • It also provides powerful integra2on with the rest of the Spark ecosystem (e.g., integra2ng SQL query processing with machine learning) • It enables unmodified Hadoop Hive queries to run up to 100x faster on exis2ng deployments and data
ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Spark Ecosystem Spark SQL - DataFrames val sc: SparkContext // An exis2ng SparkContext. val sqlContext = new org.apache.spark.sql.SQLContext(sc) // Create the DataFrame val df ="examples/src/main/resources/people.json") // Displays the content of the DataFrame to stdout // age name // 20 Michael // 30 Andy // 19 JusLn
ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Spark Ecosystem Spark SQL - DataFrames // Select only the "name" column"name").show() // name // Michael // Andy // JusLn df.filter(df("age") > 21).show() df.groupBy("age").count().show() ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Spark Ecosystem Spark Streaming • Spark component that provides the ability to process and analyze live streams of data in real-2me. • Web logs, online posts and updates from web services, logfiles, etc. • Enables powerful interac2ve and analy2cal applica2ons across both streaming and historical data • Integrates with a wide variety of popular data sources, including HDFS, Flume, Kapa, and TwiSer • API for manipula2ng data streams
ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Spark Ecosystem Spark Mllib – Machine Learning • MLlib is a scalable machine learning library that delivers both high-quality algorithms (e.g., mul2ple itera2ons to increase accuracy) and speed (up to 100x faster than MapReduce). • Provides mul2ple types of machine learning algorithms, including classifica2on, regression, clustering, and collabora2ve filtering, as well as suppor2ng func2onality such as model evalua2on and data import. • The library is usable in Java, Scala, and Python as part of Spark applica2ons, so that can be included in complete workflows
ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Spark Ecosystem Spark GraphX – Graph Computa2on • It is a library for manipula2ng graphs • Performs graph-parallel computa2ons • GraphX is a graph computa2on engine built on top of Spark that enables users to interac2vely build, transform and reason about graph structured data at scale • GraphX extends the Spark RDD API • Provides a library of graph algorithms (e.g., PageRank and triangle coun2ng)
ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
Spark Ecosystem Cluster Managers • Spark is designed to scale up from one to many thousands of compute nodes • Runs on diverse cluster managers: • Hadoop YARN • Apache Mesos [1] • Standalone Scheduler – a simple cluster manager included in Spark
ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.
References [1] Apache Mesos. hSp:// [2] ApacheHadoop. hSp:// [3] ApacheMahout. hSps:// [4] Shi, Juwei; Qiu, Yunjie et all. "Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analy2cs.”. IBM Research China, IBM Almadem Research Center, Renmin University of China. [5] Safari books online hSps:// learning-spark/.
ITV-DS, Applied Compu2ng Group. Sergio Viademonte, PhD.