Same as for Hadoop MapReduce Avoids network IO workers should manage local data

Same as for hadoop mapreduce avoids network io

This preview shows page 53 - 66 out of 66 pages.

Same as for Hadoop MapReduce Avoids network I/O, workers should manage local data Data Locality and Caching First run: data not in cache, so use HadoopRDD’s locality preferences (from HDFS) Second run: FilteredRDD is in cache, so use its locations If something falls out of cache, go back to HDFS
Image of page 53
lines = spark.textFile( “hdfs://...” ) errors = lines. filter ( lambda s: s.startswith (“ERROR”) ) messages = errors. map ( lambda s: s.split (“ \ t”)[2] ) messages. cache () Block 1 Block 2 Block 3 Worker Worker Worker Driver messages. filter ( lambda s: “ mysql ” in s ). count () messages. filter ( lambda s: “ php ” in s ). count () . . . tasks results Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Action Full-text search of Wikipedia 60GB on 20 EC2 machine 0.5 sec vs. 20s for on-disk 3 Spark Core Programming in spark: Log mining example Load error messages from a log into memory, then interactively search for various patterns 54
Image of page 54
3 Spark Core 55 Programming in spark Log mining example
Image of page 55
56
Image of page 56
57
Image of page 57
58
Image of page 58
59
Image of page 59
60
Image of page 60
61
Image of page 61
62
Image of page 62
63
Image of page 63
64
Image of page 64
3 Spark Core words map coalesce Results Text file tuples counts Word Count Example saveAsTextFil e reduceByKey 65
Image of page 65
66 Simple/Less code Multiple stages pipeline Operations Transformations: apply user code to distribute data in parallel Actions: assemble final output from distributed data Code: Hadoop vs Spark (e.g., Word Count)
Image of page 66

You've reached the end of your free preview.

Want to read all 66 pages?

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture