Example
•
ratings = sc.textFile("ratings.txt") \
.map(lambda s: s[1:1].split(",")) \
.map(lambda p: (p[0], int(p[1]))) \
.cache()
•
ratings.reduceByKey(lambda a, b: a +
b).collect()
–
ratings RDD will be computed for the first time &
result cached
117
Example
•
ratings.countByKey()
–
It will use cached content of "ratings" rdd
118
Automatic persistence
•
Spark
automatically persists
intermediate data
in shuffling operations (e.g., reduceByKey)
•
This avoids recomputation when node fails
119
Kmeans clustering
•
Find k clusters in a data set
–
k is predetermined
•
Iterative process
–
Start with initial guess of centers of clusters
–
Repeatedly refine the guess until stable (e.g.,
centers do not change much)
•
Need to use data set at each iteration
120
Kmeans clustering
•
Assign point p to the closest center c
–
Distance = Euclidean distance between p and c
•
Recompute the centers based on assignments
•
Coordinates of center of a cluster =
–
Average coordinate of all points in the cluster
–
E.g., (1, 1, 1) (3, 3, 3) => center: (2, 2, 2)
121
2
1.5
1
0.5
0
0.5
1
1.5
2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Iteration 6
Kmeans clustering
123
Persist data points in memory
Sum of distances
between new and old
centers
Initial centers
New centers
Parse input & find closest center
124
kmeansdata.txt
•
A text file contains the following lines
–
0.0 0.0 0.0
–
0.1 0.1 0.1
–
0.2 0.2 0.2
–
9.0 9.0 9.0
–
9.1 9.1 9.1
–
9.2 9.2 9.2
•
Each line is a 3dimensional data point
125
Parse & cache the input dataset
•
"data" RDD is now cached in main memory
126
Generating initial centers
•
Recall takeSample() action
–
False: sample without replacement
–
K = 2
127
Assign point to its closest center
•
Center 0 has points: (0, 0, 0) and (.1, .1, .1)
•
Center 1 has the rest: (.2, .2, .2), (.9, .9, .9), …
128
Getting statistics for each center
•
pointStats has a keyvalue pair for each center
•
Key is center # (0 or 1 for this example)
•
Value is a tuple (sum, count)
–
sum = the sum of coordinates over all points in
the cluster
–
Count = # of points in the cluster
129
Computing coordinates of new centers
•
Coordinate = sum of point coordinates/count
–
E.g., center 0: [.1, .1, .1] /2 = [.05, .05, .05]
130
Can use mapValues here too:
newPoints1 = pointStats.mapValues(lambda stv: stv[0]/stv[1]).collect()
Distance btw new & old centers
•
Old center: [.1, .1, .1] and [.2, .2, .2]
•
New center: [.05, .05, .05] and [6.875, 6.875,
6.875]
•
Distance = (.1.05)
2
*3 + (6.875.2)
2
*3 ~ 133.67
–
To be more exact, it is sqrt(133.67) = 11.56
131
RDD operations
•
A complete list:
–
yspark.html
132
Resources
•
Spark programming guide:
–
•
Lambda, filter, reduce and map:
–
•
Improving Sort Performance in Apache Spark: It’s
a Double
–

sortperformanceinapachesparkitsadouble/
133
Readings
•
Spark: Cluster Computing with Working Sets
,
2010.
You've reached the end of your free preview.
Want to read all 137 pages?
 Fall '14