WordCount Mapper 36 Data types of input key value Data types of output key

Wordcount mapper 36 data types of input key value

This preview shows page 36 - 50 out of 94 pages.

WordCount: Mapper 36 Data types of input key-value Data types of output key-value Object can be replaced with LongWritable Key-value pairs with specified data types
Image of page 36
WordCount: Reducer 37 Data types of input key-value Should be the same as output data types of mapper Data types of output key-value A list of values
Image of page 37
Checking map input map input: key=0, value=hello world map input: key=12, value=hello this world map input: key=29, value=hello hello world 38
Image of page 38
Checking reduce input reduce input: key=hello, values=1 1 1 1 reduce input: key=this, values=1 reduce input: key=world, values=1 1 1 39
Image of page 39
Map and reduce tasks in Hadoop A node may run multiple map/reduce tasks Typically, one map task per input split (chunk of data) One reduce task per partition of map output E.g., partition by key range or hashing 40
Image of page 40
Mapper and Reducer Each map task runs an instance of Mapper Mapper has a map function Map task invokes the map function of the Mapper once for each input key-value pair Each reduce task runs an instance of Reducer Reducer has a reduce function Reduce task invokes the reduce function of the Reducer once for every different intermediate key 41
Image of page 41
Reduce function Input: a key and an iterator over the values for the key Values are NOT in any particular order Reduce function is called once for every different key (received by the reduce task) 42
Image of page 42
Roadmap Hadoop architecture HDFS MapReduce MapReduce implementation Map & reduce functions and tasks Shuffling & group by Input and output format Combiner Compile & run MapReduce programs 43
Image of page 43
Shuffling Process of distributing intermediate key-values to the right reduce tasks It is the only communication among map and reduce tasks Individual map tasks do not exchange data directly with other map tasks They are not even aware of existence of their peers 44
Image of page 44
Shuffling Begins when a map task completed on a node All intermediate key-value pairs with the same key are sent to the same reducer task Partitioning method defined in Partitioner class Default rule: partition by hashing the key 45
Image of page 45
Sorting Key-value pairs from the same map task are sorted first (by key) before they are sent to the reduce task Each reduce task receives up to M # of sorted files (by key) M = # of map tasks 46
Image of page 46
Merging System merges the sorted files (details: here ) Recall merge-sort in external sorting Difference: now sorting/merging done in parallel Shuffling and merging happen simultaneously Merging a new file once it is fetched without waiting for other files to be fetched 47
Image of page 47
Partitioning, shuffling & merging 48 Keys in the same partition are sorted (keys from different partitions may not be) Merging by partition (& then by key) Merging by key
Image of page 48
Roadmap Hadoop architecture HDFS MapReduce MapReduce implementation Map & reduce functions and tasks Shuffling & group by Input and output format Combiner Compile & run MapReduce programs 49
Image of page 49
Image of page 50

You've reached the end of your free preview.

Want to read all 94 pages?

  • Fall '14

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture