getEntryreaders partitioner key val if entry null SystemerrprintlnKey not found

Getentryreaders partitioner key val if entry null

This preview shows page 293 - 296 out of 647 pages.

MapFileOutputFormat.getEntry(readers, partitioner, key, val) ; if (entry == null) { System.err.println("Key not found: " + key); return -1; } NcdcRecordParser parser = new NcdcRecordParser(); parser.parse(val.toString()); System.out.printf("%s\t%s\n", parser.getStationId(), parser.getYear()); return 0; } public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new LookupRecordByTemperature(), args); System.exit(exitCode); } } The getReaders() method opens a MapFile.Reader for each of the output files created by the MapReduce job. The getEntry() method then uses the partitioner to choose the reader for the key and finds the value for that key by calling Reader ’s get() method. If getEntry() returns null , it means no matching key was found. Otherwise, it returns the value, which we translate into a station ID and year. To see this in action, let’s find the first entry for a temperature of –10°C (remember that temperatures are stored as integers representing tenths of a degree, which is why we ask for a temperature of –100 ): 270 | Chapter 8: MapReduce Features
Image of page 293
% hadoop jar hadoop-examples.jar LookupRecordByTemperature output-hashmapsort -100 357460-99999 1956 We can also use the readers directly, in order to get all the records for a given key. The array of readers that is returned is ordered by partition, so that the reader for a given key may be found using the same partitioner that was used in the MapReduce job: Example 8-7. Retrieve all entries with a given key from a collection of MapFiles public class LookupRecordsByTemperature extends Configured implements Tool { @Override public int run(String[] args) throws Exception { if (args.length != 2) { JobBuilder.printUsage(this, "<path> <key>"); return -1; } Path path = new Path(args[0]); IntWritable key = new IntWritable(Integer.parseInt(args[1])); Reader[] readers = MapFileOutputFormat.getReaders(path, getConf()); Partitioner<IntWritable, Text> partitioner = new HashPartitioner<IntWritable, Text>(); Text val = new Text(); Reader reader = readers[partitioner.getPartition(key, val, readers.length)]; Writable entry = reader.get(key, val); if (entry == null) { System.err.println("Key not found: " + key); return -1; } NcdcRecordParser parser = new NcdcRecordParser(); IntWritable nextKey = new IntWritable(); do { parser.parse(val.toString()); System.out.printf("%s\t%s\n", parser.getStationId(), parser.getYear()); } while(reader.next(nextKey, val) && key.equals(nextKey)); return 0; } public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new LookupRecordsByTemperature(), args); System.exit(exitCode); } } And here is a sample run to retrieve all readings of –10°C and count them: % hadoop jar hadoop-examples.jar LookupRecordsByTemperature output-hashmapsort -100 \ 2> /dev/null | wc -l 1489272 Sorting | 271
Image of page 294
Total Sort How can you produce a globally sorted file using Hadoop? The naive answer is to use a single partition. 4 But this is incredibly inefficient for large files, since one machine has to process all of the output, so you are throwing away the benefits of the parallel ar- chitecture that MapReduce provides.
Image of page 295
Image of page 296

You've reached the end of your free preview.

Want to read all 647 pages?

  • Spring '17
  • Hadoop, MAin Book

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture