hw6_sol

hw6_sol - CPS216 Data-Intensive Computing Systems Fall 2011...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CPS216 Data-Intensive Computing Systems Fall 2011 Problem 1 (I) Logical plan -> Physical plan -> MapReduce plan Assignment 6 Solutions 1 (II) clicks: { (123, cnn, 2, 12) (123, bbc, 2, 13) (123, yahoo, 3, ) (124, bbc, 1, 11) (124, yahoo, 1, ) (124, bbc, 2, 15) (125, bbc, 2, ) (125, google, 2, ) } fltrd: { (123, cnn, 2, 12), (123, bbc, 2, 13) } byuser: {(123,{(123,cnn,2,12),(123,yahoo,3,),(123,bbc,2,13)}) (124,{(124,bbc,1,11),(124,yahoo,1,),(124,bbc,2,15)}) (125,{(125,bbc,2,),(125,google,2,)}) } uniqPages: {124, {bbc} } uniqLinks: {123, {2} } result: { (123, 2, 1), (124, 1, 2), (125, 0, 0) } 2 (III) No, they are not. For a group where all records have viewedat NULL the first query will output the group with 0 uniqPages and uniqLinks counts, but the second query's output will not contain this group at all since it was filtered out already. This means that using the second query, the result in the previous example would be just {(123, 2, 1), (124, 1, 2)}. Problem 2 (I) (II) To call the UDF, the urls have to be grouped together by category and passed as a parameter. Because the largest category contains 100 million urls the memory needed will be approximately 100 000 000 x 50 bytes = 5GB. (III) Since the m1.small nodes on Amazon EC2 have 1.7GB of memory which is far less than what is needed for one key, the problems encountered could either be very poor performance (if the system starts trashing and there is a large amount of I/O to disk) or the system could completely crash. (IV) This would be a one MR job program. The mapper will process the input (K1 = offset and V1 = line) and output K2 = category and a composite V2 formed of a char url, a double pagerank and an int count set to 1. The combiner will look for every category for the url with the maximum pagerank and count how many urls there are. For every K2 (category) it will output one V2 = url_with_max_pagerank, max_pagerank, count_of_urls. The reducer will receive for every category a list of V2s and similarly to the combiner it will compute the maximum pagerank and sum up the counts. The reducer will also perform the filter and output K3 = category; V3 = url, max_pagerank only for the categories where the count is > 50000000. Problem 3 Because the input data is the same, the mapper will double each record and tag one of the copies as part of the `pageid' grouping and the other copy as part of the `linkid' grouping. The reducer can then interpret these tags and perform the appropiate computation for each record. Thus, both operations can be processed in only one MR job. 3 Problem 4 Logical plan -> Physical plan -> MR plan 4 5 Problem 5 Logical plan -> Physical plan -> MR plan 6 7 ...
View Full Document

Ask a homework question - tutors are online