Same as for Hadoop MapReduce Avoids network I/O, workers should manage local data • Data Locality and Caching First run: data not in cache, so use HadoopRDD’s locality preferences (from HDFS) Second run: FilteredRDD is in cache, so use its locations If something falls out of cache, go back to HDFS
lines = spark.textFile( “hdfs://...” ) errors = lines. filter ( lambda s: s.startswith (“ERROR”) ) messages = errors. map ( lambda s: s.split (“ \ t”) ) messages. cache () Block 1 Block 2 Block 3 Worker Worker Worker Driver messages. filter ( lambda s: “ mysql ” in s ). count () messages. filter ( lambda s: “ php ” in s ). count () . . . tasks results Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Action Full-text search of Wikipedia • 60GB on 20 EC2 machine • 0.5 sec vs. 20s for on-disk 3 Spark Core • Programming in spark: Log mining example – Load error messages from a log into memory, then interactively search for various patterns 54
3 Spark Core 55 • Programming in spark – Log mining example
3 Spark Core words map coalesce Results Text file tuples counts Word Count Example saveAsTextFil e reduceByKey 65
66 • Simple/Less code • Multiple stages pipeline • Operations Transformations: apply user code to distribute data in parallel Actions: assemble final output from distributed data Code: Hadoop vs Spark (e.g., Word Count)
You've reached the end of your free preview.
Want to read all 66 pages?
- Spring '19
- Data Management, Computer program, Hadoop, Spark Architecture