Lecture 14 Spark.pdf - Apache Spark INF 551 553 Wensheng Wu 1 Roadmap Spark History features RDD and installation RDD operations Creating initial RDDs

Lecture 14 Spark.pdf - Apache Spark INF 551 553 Wensheng Wu...

This preview shows page 1 - 17 out of 139 pages.

Apache Spark INF 551 & 553 Wensheng Wu 1
Image of page 1
Roadmap Spark History, features, RDD, and installation RDD operations Creating initial RDDs Actions Transformations Examples Shuffling in Spark Persistence in Spark 2
Image of page 2
History 3 Apache took over Hadoop
Image of page 3
Characteristics of Hadoop Acyclic data flow model Data loaded from stable storage (e.g., HDFS) Processed through a sequence of steps Results written to disk Batch processing No interactions permitted during processing 4
Image of page 4
Problems Ill-suited for iterative algorithms that requires repeated reuse of data E.g., machine learning and data mining algorithms such as k-means, PageRank, logistic regression Ill-suited for interactive exploration of data E.g., OLAP on big data 5
Image of page 5
Spark Support working sets (of data) through RDD Enabling reuse & fault-tolerance 10x faster than Hadoop in iterative jobs Interactively explore 39GB (Wikipedia dump) with sub-second response time Data were distributed over 15 EC2 instances 6
Image of page 6
Spark Provides libraries to support embedded use of SQL stream data processing machine learning algorithms processing of graph data 7
Image of page 7
Spark Support diverse data sources including HDFS, Cassandra, HBase, and Amazon S3 8
Image of page 8
RDD: Resilient Distributed Dataset RDD Read-only, partitioned collection of records Operations performed on partitions in parallel Maintain lineage for efficient fault-tolerance Methods of creating an RDD from an existing collection (e.g., Python list/tuple) from an external file 9
Image of page 9
RDD: Resilient Distributed Dataset Distributed Data are divided into a number of partitions & distributed across nodes of a cluster to be processed in parallel Resilient Spark keeps track of transformations to dataset Enable efficient recovery on failure (no need to replicate large amount of data across network) 10
Image of page 10
Architecture SparkContext (SC) object coordinates the execution of application in multiple nodes Similar to Job Tracker in Hadoop MapReduce 11 SC: sending tasks Acquiring resources Executor: sending responses
Image of page 11
Components Cluster manager Allocate resources across applications Can run Spark's own cluster manager or Apache YARN (Yet Another Resource Negotiator) Executors Run tasks & store data 12
Image of page 12
Spark installation Choose "pre-built for Hadoop 2.7 and later" Direct link (choose version 2.2.1): ark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz 13
Image of page 13
Spark installation tar xvf spark-2.2.1-bin-hadoop2.7.tgz This produces "spark-2.2.1-bin-hadoop2.7" folder Containing all Spark stuffs (scripts, programs, libraries, examples, data) 14
Image of page 14
Prerequisites Make sure Java is installed & JAVA_HOME is set 15
Image of page 15
Accessing Spark from Python Interactive shell: bin/pyspark A SparkContext object sc will be automatically created bin/pyspark --master local[4] This starts Spark on local host with 4 threads "--master" specifies the location of Spark master node 16
Image of page 16
Image of page 17

You've reached the end of your free preview.

Want to read all 139 pages?

  • Fall '14
  • RDD, Func

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture