textFileratingstxt maplambda s s1 1split maplambda p p0 intp1 reduceByKeylambda

Textfileratingstxt maplambda s s1 1split maplambda p

This preview shows page 645 - 663 out of 676 pages.

sc.textFile("ratings.txt") \ .map(lambda s: s[1:-1].split(",")) \ .map(lambda p: (p[0], int(p[1]))) \ .reduceByKey(lambda a, b: a + b) \ .collect() => [(u'patrick', 5), (u'aaron', 9), (u'reynold', 1), (u'matei', 3)] 109
Image of page 645
Execution steps Note that reduceByKey requires shuffling 110 Strip-off () Tokenize by ',' Turn values into integers
Image of page 646
Roadmap Spark History, features, RDD, and installation RDD operations Creating initial RDDs Actions Transformations Examples Shuffling in Spark Persistence in Spark 111
Image of page 647
Shuffling Data are essentially repartitioned E.g., reduceByKey repartitions the data by key A costly operation: a lot of local & network I/O's 112
Image of page 648
Another example: sortByKey Sampling stage: Sample data to create a range-partitioner Ensure even partitioning "Map" stage: Write (sorted) data to destined partition for reduce stage "Reduce" stage: get map output for specific partition Merge the sorted data 113 Data are shuffled between Map and Reduce stage
Image of page 649
Transformations that require shuffling reduceByKey(func) groupByKey() sortByKey([asc]) distinct() 114
Image of page 650
Transformations that require shuffling join(rdd): leftOuterJoin rightOuterJoin fullOuterJoin aggregateByKey(zeroValue, seqOp, combOp) intersection/subtract subtractByKey 115
Image of page 651
Transformations that do not need shuffling map(func) filter(func) flatMap(func) mapValues(func) union mapPartitions(func) 116
Image of page 652
Roadmap Spark History, features, RDD, and installation RDD operations Creating initial RDDs Actions Transformations Examples Shuffling in Spark Persistence in Spark 117
Image of page 653
RDD persistence rdd.persist(<storageLevel>) Store the content of RDD for later reuse storageLevel specifies where content is stored E.g., in memory (default) or on disk rdd.persist() or rdd.cache() Content stored in main memory 118
Image of page 654
RDD persistence Executed at nodes having partitions of RDD Avoid re-computation of RDD in reuse 119
Image of page 655
Example ratings = sc.textFile("ratings.txt") \ .map(lambda s: s[1:-1].split(",")) \ .map(lambda p: (p[0], int(p[1]))) \ .cache() ratings.reduceByKey(lambda a, b: a + b).collect() ratings RDD will be computed for the first time & result cached 120
Image of page 656
Example ratings.countByKey() It will use cached content of "ratings" rdd 121
Image of page 657
Automatic persistence Spark automatically persists intermediate data in shuffling operations (e.g., reduceByKey) This avoids re-computation when node fails 122
Image of page 658
K-means clustering Find k clusters in a data set k is pre-determined Iterative process Start with initial guess of centers of clusters Repeatedly refine the guess until stable (e.g., centers do not change much) Need to use data set at each iteration 123
Image of page 659
K-means clustering Assign point p to the closest center c Distance = Euclidean distance between p and c Re-compute the centers based on assignments Coordinates of center of a cluster = Average coordinate of all points in the cluster E.g., (1, 1, 1) (3, 3, 3) => center: (2, 2, 2) 124
Image of page 660
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5 Iteration 6 K-means clustering
Image of page 661
126 Persist data points in memory Sum of distances between new and old centers Initial centers
Image of page 662
Image of page 663

You've reached the end of your free preview.

Want to read all 676 pages?

  • Fall '14

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture