cis6930fa11_Spark - Spark" Cluster Computing with...

Info iconThis preview shows pages 1–9. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Spark" Cluster Computing with Working Sets Outline ● Why? ● Mesos ● Resilient Distributed Dataset ● Spark & Scala ● Examples ● Uses Why' ● MapReduce deficiencies: ○ Standard Dataflows are Acyclic ■ Prevents Iterative Jobs ■ Not for Applications that reuse a \TWPNSJ XHY ■ Machine Learning, Graph Applications ○ Interactive Analytics ■ Ad hoc questions answered by basic SQL queries ■ Have to wait for reads from disk, or deserialization ○ Multiple Queries ■ Processed, but ephemeral ■ Each query is individual, even if they all rely on similar base data Why' " Iterative Problems ● MapReduce ● What is meant by "reuse a \TWPNSJ XHY "? ○ Same data is reused across iterations ○ like Virtual Memory ● Example Algorithms ○ k-means - ■ data points to be classified ○ Logistic Regression ■ data points to be classified ○ Expectation Maximization ■ Observed Data ○ Alternating Least Squares ■ Feature Vectors for each side How' ● Caching ● Avoid reading from files and deserializing java objects ● Main Hadoop speedup was by caching the files in memory (an OS level thing), then by caching serialized data ○ These still both need to read data in! ● Avoid even reading the data and keep it around as standard java objects Mesos ● Resource isolation and sharing across distributed applications ● Manages pools of compute instances ○ distribution of files, work, memory ○ network communications ● Allow heterogeneous and incompatible systems to coexist within a cluster. ● Give each job the resources it needs to ensure throughput ○ Don't starve anyone ○ But make sure to utilize all available resources ● Manages different types of systems in a cluster ○ Spark, Dryad, Hadoop, MPI ● Allow multiple datasets for multiple groups to process, all using their own data. Resilient Distributed Dataset...
View Full Document

This note was uploaded on 11/09/2011 for the course CIS 6930 taught by Professor Staff during the Fall '08 term at University of Florida.

Page1 / 37

cis6930fa11_Spark - Spark" Cluster Computing with...

This preview shows document pages 1 - 9. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online