CLOUDWICK DEVELOPER LABS - www.cloudwick.com HADOOP DEVELOPER LABS Table of Contents LAB 1 Hands-on Hadoop File System Navigation 2 LAB 2 Analyze data

CLOUDWICK DEVELOPER LABS - www.cloudwick.com HADOOP...

This preview shows page 1 out of 50 pages.

You've reached the end of your free preview.

Want to read all 50 pages?

Unformatted text preview: HADOOP DEVELOPER LABS Table of Contents LAB 1: Hands-­‐on Hadoop File System Navigation ........................................................ 2 LAB 2: Analyze data with MapReduce -­‐ Word Count ................................................... 5 LAB 3: Hadoop Streaming .................................................................................................. 12 LAB 4: MapReduce Using Local Job Runner ................................................................. 16 LAB 5: MapReduce Custom Partitioner ......................................................................... 18 LAB 6: MapReduce Unit Tests ........................................................................................... 26 LAB 7: MapReduce Counters ............................................................................................. 29 LAB 8: Hive Queries .............................................................................................................. 32 LAB 9: Pig Queries ............................................................................................................... 37 LAB 10: Oozie Workflows ................................................................................................... 42 LAB 11: Sqoop Queries ........................................................................................................ 47 Copyright ©2012 Cloudwick Technologies Address: 3226 Diablo Ave Hayward, CA 94545 | email: [email protected] LAB 1: Hands-­‐on Hadoop File System Navigation Hands-­‐on Hadoop File System Navigation: · Creating directory in HDFS · Write Files to HDFS from local File System · Changing permissions of a directory · Read Files from HDFS Creating directory in HDFS: The below command creates the directory in HDFS Mkdir Usage: hadoop dfs -­‐mkdir <paths> Takes path uri's as argument and creates directories. The behavior is much like unix mkdir -­‐p creating parent directories along the path. Example: hadoop dfs -­‐mkdir /user/hadoop/dir1 /user/hadoop/dir2 hadoop dfs -­‐mkdir hdfs://host1:port1/user/hadoop/dir hdfs://host2:port2/user/hadoop/dir Here, The directory “dir1” is created Write Files to HDFS from local File System put Copyright ©2012 Cloudwick Technologies Address: 3226 Diablo Ave Hayward, CA 94545 | email: [email protected] Usage: hadoop dfs -­‐put <localsrc> ... <dst> Copy single src, or multiple srcs from local file system to the destination filesystem. Also reads input from stdin and writes to destination filesystem. • hadoop dfs -­‐put /root/abc.txt /user/cloudwick/ • hadoop dfs -­‐put localfile1 localfile2 /user/hadoop/hadoopdir • hadoop dfs -­‐put localfile hdfs://host:port/hadoop/hadoopfile • hadoop dfs -­‐put -­‐ hdfs://host:port/hadoop/hadoopfile Reads the input from stdin. copyFromLocal Usage: hadoop fs -­‐copyFromLocal /root/abc.txt /user/cloudwick/ Similar to PUT command, except that the source is restricted to a local file reference. Changing permissions of a directory Chmod Usage: hadoop fs -­‐chmod [-­‐R] /user/cloudwick/abc.txt Change the permissions of files. With -­‐R, make the change recursively through the directory structure. The user must be the owner of the file, or else a super-­‐user. Copyright ©2012 Cloudwick Technologies Address: 3226 Diablo Ave Hayward, CA 94545 | email: [email protected] chown Usage: hadoop fs -­‐chown [-­‐R] [OWNER][:[GROUP]] URI [URI ] hadoop fs –chown –R cloudwick:supergroup /user/cloudwick/abc.txt Change the owner of files. With -­‐R, make the change recursively through the directory structure. The user must be a super-­‐user. Read Files from HDFS ls Usage: hadoop dfs -­‐ls <args> For a file returns stat on the file with the following format: filename <number of replicas> filesize modification_date modification_time permissions userid groupid For a directory it returns list of its direct children as in unix. A directory is listed as: dirname <dir> modification_time modification_time permissions userid groupid Example: hadoop dfs -­‐ls /user/cloudwick/ lsr Usage: hadoop dfs -­‐lsr <args> Recursive version of ls. Similar to Unix ls -­‐R. Copyright ©2012 Cloudwick Technologies Address: 3226 Diablo Ave Hayward, CA 94545 | email: [email protected] LAB 2: Analyze data with MapReduce -­‐ Word Count Hands-­‐On MapReduce Word Count Program: This Hands on exercise will demonstrate a simple MapReduce program which will count the occurrences of the word’s found in the input file, in this case we’ll be using the movie lens data loaded into HDFS in the previous Sqoop exercise. 1. For this Word Count Application, you’ll be modifying the following classes located at the following path’s: $ cd ~/cloudwick/developer/wordcount 2. The following are the files that you need to modify: ~/cloudwick/developer/wordcount/WCMapper.java ~/cloudwick/developer/wordcount/WCReducer.java ~/cloudwick/developer/wordcount/WCDriver.java WCMapper.java is the mapper for the job. WCReducer.java is the reducer for the job. WCDriver.java is the driver class for the job. 3. The Java files already have declarations in place; all you need to do is to write the logic for the respective classes. 4. How Word Count works: What word count does is for a give input path, takes all the files and count’s the occurrences of a specific word for all Copyright ©2012 Cloudwick Technologies Address: 3226 Diablo Ave Hayward, CA 94545 | email: [email protected] the files in the path and output’s two columns as below: Word The Is … Count 134 456 … Mapper -­‐> The mapper reads every line in a file and sends the selected information for the shuffler phase which will aggregate the data going to a specific reducer. Sample output form Mapper is as follows: The Is For The …. 1 1 1 1 … The Is For …. 134 456 34 … Reducer -­‐> The reducer takes the input from shuffle phase which are partitioned specific to the reducer and returns a final value for the word as follows: 5. After writing the logic for the mapper and reducer, you’ll have to compile the code into a Jar using the following command: $ javac –classpath /usr/lib/hadoop-0.20/hadoop-core.jar *.java The above command will generate the classes required to Copyright ©2012 Cloudwick Technologies Address: 3226 Diablo Ave Hayward, CA 94545 | email: [email protected] generate the jar. 6. Now create the jar from the compiled classes: $ jar cvf wordcount.jar *.class 7. Now submit the MapReduce job to the Hadoop on the movielens dataset input path as follows: $ hadoop jar wordcount.jar WCDriver movielens wordcountoutput The above command will kick off a map reduce job, and the output is written to wordcountoutput directory in HDFS. 8. Check the output of the wordcount if the job is successfully completed, using the following command: $ hadoop fs –cat wordcountoutput/part-* | less 9. Solution: The solution to the wordcount can be found at the bottom. Using Eclipse as the Development Environment and generating jar file: Copyright ©2012 Cloudwick Technologies Address: 3226 Diablo Ave Hayward, CA 94545 | email: [email protected] 1. Open the eclipse from the Desktop 2. Expand the wordcount project from the left hand side in the “Project Explorer” 3. You’ll find three classes: Mapper.java Reducer.java Driver.java 4. Complete the classes and to compile the jar follow these steps: • Right click on the package and select “Export” • Select “JAR file” from “Java” and click on “Next” • Select the “JAR file export destination” and click “Next” • Click Next and in the next window “JAR Manifest Specification” Browse and select the Main class as “Driver class”. Solution: WCMapper.java Copyright ©2012 Cloudwick Technologies Address: 3226 Diablo Ave Hayward, CA 94545 | email: [email protected] import java.io.IOException; import import import import org.apache.hadoop.io.IntWritable; org.apache.hadoop.io.LongWritable; org.apache.hadoop.io.Text; org.apache.hadoop.mapreduce.Mapper; public class WCMapper extends Mapper<LongWritable, Text, Text, IntWritable> { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String s = value.toString(); for (String word : s.split("\\W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } } WCReducer.java Copyright ©2012 Cloudwick Technologies Address: 3226 Diablo Ave Hayward, CA 94545 | email: [email protected] import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class WCReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0; for (IntWritable value : values) { wordCount += value.get(); } context.write(key, new IntWritable(wordCount)); } } WCDriver.java import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class WCDriver extends Configured implements Tool { Copyright ©2012 Cloudwick Technologies Address: 3226 Diablo Ave Hayward, CA 94545 | email: [email protected] @Override public int run(String args) throws Exception { if (args.length != 2) { System.out.printf( "Usage: %s [generic options] <input dir> <output dir>\n", getClass() .getSimpleName()); ToolRunner.printGenericCommandUsage(System.out); return -1; } Job job = new Job(getConf()); job.setJarByClass(WCDriver.class); job.setJobName(this.getClass().getName()); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WCMapper.class); job.setReducerClass(WCReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); if (job.waitForCompletion(true)) { return 0; } return 1; } public static void main(String args) throws Exception { int exitCode = ToolRunner.run(new WCDriver(), args); System.exit(exitCode); } } Copyright ©2012 Cloudwick Technologies Address: 3226 Diablo Ave Hayward, CA 94545 | email: [email protected] LAB 3: Hadoop Streaming Hands-­‐On Hadoop Streaming: This Hand’s on exercise will demonstrate Hadoop Streaming using python and word count example. You can use Perl, Python, PHP, Ruby, and Shell Scripting as your choice of language for developing streaming solution for a problem. Even CPP is supported using Hadoop pipes. 10. For developing Hadoop streaming application you need to write code for mapper and reducer Mapper: Design the mapper program in such a way that the mapper will read the input from STDIN, split the words and output a list of lines mapping words to their (intermediate) counts to STDOUT. The mapper will not compute an intermediate sum of word’s occurances instead it will output “<word> 1”. Also, the output should be in the format of “key <tab> value <newline>”. Reducer: Design the reducer program in such a way that the reducer will read the results of mapper from STDIN in the form “key <tab> value <newline>” and sum the occurrences of each word to a final count, and output its results to STDOUT. Also, the output should be in the format of “key <tab> value <newline> 11. Go into location “cd ~/cloudwick/developer/streaming” Copyright ©2012 Cloudwick Technologies Address: 3226 Diablo Ave Hayward, CA 94545 | email: [email protected] 12. Sample Python “mapper.py” program for word count example: #!/usr/bin/env python import sys #get all lines from stdin for line in sys.stdin: #remove leading and trailing white spaces line = line.strip() #split the line words = line.split() #ouput each and every word as tuple [word,1] #output format: #”key <tab> value <newline> for word in words: print ‘%s\t%s’ % (word, 1) 13. Sample Python “reducer.py” program for word count example: Copyright ©2012 Cloudwick Technologies Address: 3226 Diablo Ave Hayward, CA 94545 | email: [email protected] #!/usr/bin/env python from operator import itemgetter import sys current_word = None current_count = 0 word = None # input comes from STDIN for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # parse the input we got from mapper.py word, count = line.split('\t', 1) Copyright ©2012 Cloudwick Technologies Address: 3226 Diablo Ave Hayward, CA 94545 | email: [email protected] try: count = int(count) except ValueError: continue # this IF-switch only works because Hadoop sorts map output # by key (here: word) before it is passed to the reducer if current_word == word: current_count += count else: if current_word: # write result to STDOUT print '%s\t%s' % (current_word, current_count) current_count = count current_word = word # do not forget to output the last word if needed! if current_word == word: print '%s\t%s' % (current_word, current_count) 14. Test your code: $ cat mapper.py | ~/cloudwick/developer/streaming/mapper.py $ cat mapper.py | ~/cloudwick/developer/streaming mapper.py | sort –k1,1 | ~/cloudwick/developer/streaming/reducer.py 15. Running the streaming job on hadoop: Copyright ©2012 Cloudwick Technologies Address: 3226 Diablo Ave Hayward, CA 94545 | email: [email protected] $ hadoop jar /usr/lib/hadoop-0.20/contrib/hadoop*streaming*.jar –file ~/cloudwick/developer/streaming/mapper.py –mapper ~/cloudwick/developer/streaming/mapper.py –file ~/cloudwick/developer/streaming/reducer.py –reducer ~/cloudwick/developer/streaming/reducer.py –input movielens –output streamingoutput 16. Check the output: $ hadoop fs –cat streamingoutput/part-* | less LAB 4: MapReduce Using Local Job Runner Hands-­‐On MapReduce LocalJobRunner: This Hand’s on exercise will demonstrate the usage of MapReduce’s LocalJobRunner, which allow user to perform integration tests. LocalJobRunner allows user to test all the aspects of the MapReduce job, including the reading and writing of data to and from the file system being used. The LocalJobRunner can allow user to run the program entirely in the client process (on a single machine using single JVM) instead of running in distributed mode. This method is efficient in testing the MapReduce on smaller sets of data. Copyright ©2012 Cloudwick Technologies Address: 3226 Diablo Ave Hayward, CA 94545 | email: [email protected] Background: Hadoop comes bundled with the LocalJobRunner class, which Hadoop and its related projects can take advantage of testing the user code. A configuration setting enables the local job runner. Normally, mapred.job.tracker is a ‘host:pair’ combination which specifies the jobtracker address, but when this value is set to local, the job is run in-­‐process using a single JVM on the local machine with out invoking external JobTracker. LocalJobRunner not only allows user to test only how the user code plays along side of MapReduce but also will be able to test InputFormats, OutputFormats, and several other factors. 17. Running the previous word count job using local job runner $ hadoop jar wordcount.jar WCDriver -jt local movielens moviewcoutput NOTE: For LocalJobRunner to work, the mapreduce driver program should implement “Tool” and use “ToolRunner” class. NOTE: LocalJobRunner cannot run more than 1 reducer, but can support zero reducer case. Also, LocalJobRunner does not support DistributedCache. Copyright ©2012 Cloudwick Technologies Address: 3226 Diablo Ave Hayward, CA 94545 | email: [email protected] LAB 5: MapReduce Custom Partitioner Hands-­‐On MapReduce Custom Partitioner: This Hand’s on exercise will demonstrate a simple MapReduce program with custom partitioner on the “movielens” dataset. Scenario: The data in the “movie” table will be of the form: Id 1 2 .. .. Name Toy Story Jumanji Year 1995 1995 You’ll have to find out number of movies released in a year using custom partitioner and year range as follows: Year range 0-1950 1950-1980 1980-current Reducer 1 2 3 Also, you should tell the MapReduce program to use 3 reducers for this job. 18. Load the data from MySQL to local filesystem: Copyright ©2012 Cloudwick Technologies Address: 3226 Diablo Ave Hayward, CA 94545 | email: [email protected] $ mysql --user=training --password=training movielens –e “select * from movie into outfile ‘/tmp/movie.tsv’ lines terminated by ‘\n’” 19. Load the data from local filesystem to HDFS $ hadoop fs –mkdir custompartitioner $ hadoop fs –put /tmp/movie.tsv custompartitioner 20. Check if the data is loaded properly into HDFS using: $ hadoop fs –lsr custompartitioner 21. Mapper: Design the mapper in such a way that, the mapper writes out year as key and movie_name as value 1950 1950 1971 ……… 2000 Toy Story Jumanji Get Carter Tigerland 22. Partitioner: Design the partitioner such that, year ranges below 1950 are sent to one reducer, year ranges between 1950 and 1980 are sent to second reducer and year ranges starting from 1980 are sent to third reducer. 23. Reducer: Design the Reducer such that the reducer counts the number of movie’s coming in for an year and print out year Copyright ©2012 Cloudwick Technologies Address: 3226 Diablo Ave Hayward, CA 94545 | email: [email protected] followed by number of movies released in that year as follows: 1986 1987 1999 ……… 2000 97 67 265 153 24. Solution: WCMapper.java package com.mapreduce.movielens.custompartitioner; import java.io.IOException; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class WCMapper extends Mapper<LongWritable, Text, Text, Text> { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String tokens = value.toString().split("\t"); String name = tokens[1]; String year = tokens[2]; if (name.length() > 0 && year.length() > 0) { context.write(new Text(year), new Text(name)); } } } Copyright ©2012 Cloudwick Technologies Address: 3226 Diablo Ave Hayward, CA 94545 | email: [email protected] YearPartitioner.java Copyright ©2012 Cloudwick Technologies Address: 3226 Diablo Ave Hayward, CA 94545 | email: [email protected] package com.mapreduce.movielens.custompartitioner; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Partitioner; public class YearPartitioner extends Partitioner<Text, Text> { @Override public int getPartition(Text key, Text value, int numReduceTasks) { int year = Integer.parseInt(key.toString()); if(numReduceTasks == 0) return 0; if(year < 1950){ return 0; } if(year > 1950 && year < 1980){ return 1 % numReduceTasks; } else return 2 % numReduceTasks; } } Copyright ©2012 Cloudwick Technologies Address: 3226 Diablo Ave Hayward, CA 94545 | email: [email protected] WCReducer.java package com.mapreduce.movielens.custompartitioner; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class WCReducer extends Reducer<Text, Text, Text, IntWritable> { @Override public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { int wordCount = 0; for (@SuppressWarnings("unused") Text value : values) { wordCount += 1; } context.write(key, new IntWritable(wordCount)); } } WCDriver.java Copyright ©2012 Cloudwick Technologies Address: 3226 Diablo Ave Hayward, CA 94545 | email: [email protected] package com.mapreduce.movielens.custompartitioner; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.ha...
View Full Document

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture