Same as for Hadoop MapReduce
Avoids network I/O, workers should manage local data
•
Data Locality and Caching
First run: data not in cache, so use
HadoopRDD’s
locality
preferences (from HDFS)
Second run: FilteredRDD is in cache, so use its locations
If something falls out of cache, go back to HDFS

lines = spark.textFile(
“hdfs://...”
)
errors = lines.
filter
(
lambda s: s.startswith
(“ERROR”)
)
messages = errors.
map
(
lambda s: s.split
(“
\
t”)[2]
)
messages.
cache
()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.
filter
(
lambda s: “
mysql
” in s
).
count
()
messages.
filter
(
lambda s: “
php
” in s
).
count
()
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDD
Transformed RDD
Action
Full-text search of Wikipedia
•
60GB on 20 EC2 machine
•
0.5 sec vs. 20s for on-disk
3 Spark Core
•
Programming in spark: Log mining example
–
Load error messages from a log into memory, then interactively
search for various patterns
54

3 Spark Core
55
•
Programming in spark
–
Log mining example

56

57

58

59

60

61

62

63

64

3 Spark Core
words
map
coalesce
Results
Text
file
tuples
counts
Word Count Example
saveAsTextFil
e
reduceByKey
65

66
•
Simple/Less code
•
Multiple stages
pipeline
•
Operations
Transformations: apply user code to
distribute data in parallel
Actions: assemble final output from
distributed data
Code: Hadoop vs Spark (e.g.,
Word Count)

You've reached the end of your free preview.
Want to read all 66 pages?
- Spring '19
- Data Management, Computer program, Hadoop, Spark Architecture