Example ratings maplambda s s1 1split maplambda p p0 intp1

Example ratings maplambda s s1 1split maplambda p p0

This preview shows page 117 - 135 out of 137 pages.

Example ratings = sc.textFile("ratings.txt") \ .map(lambda s: s[1:-1].split(",")) \ .map(lambda p: (p[0], int(p[1]))) \ .cache() ratings.reduceByKey(lambda a, b: a + b).collect() ratings RDD will be computed for the first time & result cached 117
Image of page 117
Example ratings.countByKey() It will use cached content of "ratings" rdd 118
Image of page 118
Automatic persistence Spark automatically persists intermediate data in shuffling operations (e.g., reduceByKey) This avoids re-computation when node fails 119
Image of page 119
K-means clustering Find k clusters in a data set k is pre-determined Iterative process Start with initial guess of centers of clusters Repeatedly refine the guess until stable (e.g., centers do not change much) Need to use data set at each iteration 120
Image of page 120
K-means clustering Assign point p to the closest center c Distance = Euclidean distance between p and c Re-compute the centers based on assignments Coordinates of center of a cluster = Average coordinate of all points in the cluster E.g., (1, 1, 1) (3, 3, 3) => center: (2, 2, 2) 121
Image of page 121
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5 Iteration 6 K-means clustering
Image of page 122
123 Persist data points in memory Sum of distances between new and old centers Initial centers New centers
Image of page 123
Parse input & find closest center 124
Image of page 124
kmeans-data.txt A text file contains the following lines 0.0 0.0 0.0 0.1 0.1 0.1 0.2 0.2 0.2 9.0 9.0 9.0 9.1 9.1 9.1 9.2 9.2 9.2 Each line is a 3-dimensional data point 125
Image of page 125
Parse & cache the input dataset "data" RDD is now cached in main memory 126
Image of page 126
Generating initial centers Recall takeSample() action False: sample without replacement K = 2 127
Image of page 127
Assign point to its closest center Center 0 has points: (0, 0, 0) and (.1, .1, .1) Center 1 has the rest: (.2, .2, .2), (.9, .9, .9), … 128
Image of page 128
Getting statistics for each center pointStats has a key-value pair for each center Key is center # (0 or 1 for this example) Value is a tuple (sum, count) sum = the sum of coordinates over all points in the cluster Count = # of points in the cluster 129
Image of page 129
Computing coordinates of new centers Coordinate = sum of point coordinates/count E.g., center 0: [.1, .1, .1] /2 = [.05, .05, .05] 130 Can use mapValues here too: newPoints1 = pointStats.mapValues(lambda stv: stv[0]/stv[1]).collect()
Image of page 130
Distance btw new & old centers Old center: [.1, .1, .1] and [.2, .2, .2] New center: [.05, .05, .05] and [6.875, 6.875, 6.875] Distance = (.1-.05) 2 *3 + (6.875-.2) 2 *3 ~ 133.67 To be more exact, it is sqrt(133.67) = 11.56 131
Image of page 131
RDD operations A complete list: yspark.html 132
Image of page 132
Resources Spark programming guide: Lambda, filter, reduce and map: Improving Sort Performance in Apache Spark: It’s a Double - sort-performance-in-apache-spark-its-a-double/ 133
Image of page 133
Readings Spark: Cluster Computing with Working Sets , 2010.
Image of page 134
Image of page 135

You've reached the end of your free preview.

Want to read all 137 pages?

  • Fall '14

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture