15 Pages

intro_to_mapreduce

Course: CPS 216, Summer 2009
School: Duke
Rating:
 
 
 
 
 

Word Count: 212

Document Preview

Advanced CPS216: Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu Word Count over a Given Set of Web Pages see bob throw see bob throw see spot run 1 1 1 1 1 1 bob run see spot throw 1 1 2 1 1 see spot run Can we do word count in parallel? The MapReduce Framework (pioneered by Google) Automatic Parallel Execution in MapReduce (Google) Handles failures...

Register Now

Unformatted Document Excerpt

Coursehero >> North Carolina >> Duke >> CPS 216

Course Hero has millions of student submitted documents similar to the one
below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.

Course Hero has millions of student submitted documents similar to the one below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.
Advanced CPS216: Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu Word Count over a Given Set of Web Pages see bob throw see bob throw see spot run 1 1 1 1 1 1 bob run see spot throw 1 1 2 1 1 see spot run Can we do word count in parallel? The MapReduce Framework (pioneered by Google) Automatic Parallel Execution in MapReduce (Google) Handles failures automatically, e.g., restarts tasks if a node fails; runs multiples copies of the same task to avoid a slow task slowing down the whole job MapReduce in Hadoop (1) MapReduce in Hadoop (2) MapReduce in Hadoop (3) Data Flow in a MapReduce Program in Hadoop InputFormat Map function Sorting Partitioner & Merging Combiner Shuffling Merging Reduce function OutputFormat 1:many Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job Lifecycle of a MapReduce Job Time Input Splits Map Wave 1 Map Wave 2 Reduce Wave 1 Reduce Wave 2 How are the number of splits, number of map and reduce tasks, memory allocation to tasks, etc., determined? Job Configuration Parameters 190+ parameters in Hadoop Set manually or defaults are used How to sort data using Hadoop? Let us look at a complete example MapReduce program in Hadoop
Find millions of documents on Course Hero - Study Guides, Lecture Notes, Reference Materials, Practice Exams and more. Course Hero has millions of course specific materials providing students with the best way to expand their education.

Below is a small sample set of documents:

Duke - CPS - 216
Entering the Zettabyte Age Jeffrey Krone . 1 Kilobyte 1,000 bits/byte1 megabyte 1 gigabyte 1 terabyte 1 petabyte 1 exabyte 1 zettabyte 1,000,000 1,000,000,000 1,000,000,000,000 1,000,000,000,000,000 1,000,000,000,000,000,000 1,000,000,000,000,000,000,000
Duke - CPS - 216
Entering the Zettabyte Age Jeffrey Krone 1 Kilobyte 1,000 . bits/byte1 megabyte 1,000,000 1 gigabyte 1,000,000,000 1 terabyte 1,000,000,000,000 1 petabyte 1,000,000,000,000,000 1 exabyte 1,000,000,000,000,000,000 1 zettabyte 1,000,000,000,000,000,000,000
Duke - CPS - 216
CPS216: Data-intensive Computing SystemsFailure RecoveryShivnath Babu1Key problem Unfinished transactionExample Constraint: A=B T1: A A 2 B B22Unexpected Events: Examples: Power goes off Software bugs Disk data is lost Memory lost without CPU halt
Duke - CPS - 216
CPS216: Dataintensive Computing Systems Failure RecoveryShivnath Babu1Key problem Unfinished transactionExample Constraint: A=B T1: A A 2 B B 22Unexpected Events: Examples: Power goes off Software bugs Disk data is lost Memory lost without CPU halt
Duke - CPS - 216
CS216: Data-Intensive Computing SystemsConcurrency ControlShivnath Babu1Transaction Programming abstraction Implement real-world transactions Banking transaction Airline reservation2Transaction: Programmer's RoleTransactionConsistent StateConsi
Duke - CPS - 216
CS216: Data-Intensive Computing SystemsConcurrency ControlShivnath Babu1Transaction Programming abstraction Implement real-world transactions Banking transaction Airline reservation2Transaction: Programmer's RoleTransaction Consistent State Consi
Duke - CPS - 216
CS216: Data-Intensive Computing SystemsConcurrency Control (II)Shivnath Babu1How to enforce serializable schedules?Option 1: run system, recording P(S);at end of day, check for P(S) cycles and declare if execution was good2How to enforce serializa
Duke - CPS - 216
CS216: DataIntensive Computing SystemsConcurrency Control (II)Shivnath Babu1How to enforce serializable schedules?Option 1: run system, recording P(S); at end of day, check for P(S) cycles and declare if execution was good2How to enforce serializab
Duke - CPS - 216
Pig, a high level data processing system on HadoopIs MapReduce not Good Enough?Restricted programming modelOnly two phases Job chain for long data flow How many lines do you have for word count? Programmers are responsible for thisToo many lines of co
Duke - CPS - 216
Pig, a high level data processing system on HadoopIs MapReduce not Good Enough?Restricted programming model Only two phases Job chain for long data flow How many lines do you have for word count? Programmers are responsible for thisToo many lines of
Duke - CPS - 216
CPS216: Data-Intensive Computing SystemsIntroduction to Query ProcessingShivnath BabuQuery ProcessingDeclarative SQL Query Query PlanNOTE: You will not be tested on how well you know SQL. Understanding the SQL introduced in class will be sufficient (
Duke - CPS - 216
CPS216: Data-Intensive Computing Systems Introduction to Query ProcessingShivnath BabuQuery ProcessingDeclarative SQL Query Query PlanNOTE: You will not be tested on how well you know SQL. Understanding the SQL introduced in class will be sufficient (
Duke - CPS - 216
Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath BabuDuke UniversityAnalysis in the Big Data EraMassive DataData AnalysisInsight Key to Success = Timely and Cost-Effective Analysis9/26/2011 Starfish 2Hadoop MapReduce Ecosystem Popular solution
Duke - CPS - 216
Starfish: A Self-tuning System for Big Data AnalyticsHerodotos Herodotou, Harold Lim, Fei Dong, Shivnath BabuDuke UniversityAnalysis in the Big Data EraMassive DataData AnalysisInsight Key to Success = Timely and Cost-Effective Analysis9/26/2011 St
Miami University - SOC - 372
SOC 372Dec. 6, 2011Term PaperWhat is the glass ceiling?What factors explain its existence and persistence?Society has been dealing with gender discrimination and stereotypes ever sincethe worlds existence, and we are still seeing issues with this in
Miami University - SOC - 372
SOC372 Study Guide-Necessary preconditions for stratification Social differentiation: a situation in which people have different individual qualities andsocial statuses that are socially meaningfulo ascribed status: a status in which people are placed
Miami University - SOC - 459
November 2, 2011SOC 459I really enjoyed watching these comedies (and drama) in class last week. Pullingapart each show using a sociological perspective really makes me think of how society isnot only strongly influenced by the media, but how the media
Miami University - MGT - 415
FINAL PAPER: MY IDEAL BOARD OF ADVISORSMGT 415 Section AMiami UniversityDecember 13, 2011Outline of Final PaperI.II.III.IntroductionA. Reasons as to why the Advisory Board is needed in this company and ingeneralB. General overview of how divers
Miami University - MGT - 415
MGT 415October 13, 2011Part 11. Immelt made me feel more at ease after explaining the five main things that his company looks for in a potentialemployee: ability to learn, determination, teamwork, passion, and attitude. These are all non-quantitative
Drexel - BIO - 141
WELCOME TO BIO 141 Essential BiologyDr. Meshagae HunteBrown Stratton 103 2158952064 Dr. Lauren Sweeney Stratton 105 2158956180 Begin clicker use next class Thursday 4/3 Dry runPRS Clicker and Extra Credit Official clicker use begins next Tuesday 4/8
Drexel - BIO - 141
The Diversity of Plants What Is the Evolutionary Origin of Plants?A. Green algae gave rise to plants 1. DNA comparisons 2. Chlorophyll and photosynthetic accessory pigments 3. Food storage 4. Cellulose cell walls B. The ancestors of plants lived in fresh
Drexel - BIO - 141
Calculate your BMI - Standard BMI Calculatorhttp:/www.nhlbisupport.com/bmi/bmicalc.htmBody mass index (BMI) is a measure of body fat based on height and weight that applies to both adult men and women. Enter your weight and height using Standard or Metr
Drexel - BIO - 141
III.Your Food Intake Analysis C. Food Guide Pyramid comparison (your assessment of how your intake compared to the recommendations in the food guide pyramid) From the food guide pyramid, I think that most of what I consume throughout the week and especial
Drexel - BIO - 141
New website for nutrition analysis http:/www.nutritiondata.com/ 1. Go to the site and register. Registration is free, but you have to register so that your information can be stored. The button for registration is hereSearch tool2. Begin searching for
Drexel - BIO - 141
USDA - CNPP - MyPyramid TrackerPage 1 of 2Physical Activity Results for farah_727rash on 4/14/2008Activity TypeINACTIVITY QUIET INACTIVITY QUIETActivity DescriptionSITTING QUIETLY AND WATCHING TELEVISIONActivity CategoryHOMEMETsIntensity Classif
Drexel - BIO - 141
BIO141 Essential Biology Recitation week 2: Diet and Activity AnalysisFrom this recitation you should be able to answer the following questions: 1. How can a person's dietary intake be recorded? 2. How is dietary intake converted into an analysis of nutr
Drexel - BIO - 141
www.NutritionData.com TOTAL CONSUMPTION Composite Caloric Ratio = 3:78:11:8 (Alcohol:Carbohydrate:Fats:Protein) Composite Completeness Score = 31 Composite Fullness Factor (FF) = 2.1 Composite ND Rating = 2.1 Composite eGL = 89 Composite IF Rating = NAPR
Drexel - BIO - 141
Bio 141 Essential Biology (4.5 credits)Syllabus Spring 2007-08 Dr. Meshagae Hunte-Brown Dr. Lauren Sweeney Course Description: This course introduces the essential biological concepts needed by contemporary engineering students to make biologically infor
Drexel - BIO - 141
Bio 141 Extra Credit Biocentrism Quiz According to many conservation biologists, an important part of encouraging a conservation "ethic" among the nonbiologist public is conservation education. The most effective way that biologists can educate is to simp
Drexel - BIO - 141
TDEC 121 Winter 06 MidtermOn the answer sheet (scantron) write your Name, Student ID Number, and Recitation Section Number. Choose the best (most correct) answer for each question AND ENTER IT ON YOUR ANSWER SHEET.1. If a catalyst is added to a chemical
Drexel - BIO - 141
Drexel - ECEC - 355
C-355 Homework Assignment # 3 (Due 07/17 in class)Q1. The r's complement of an n-digit number in base r is defined as r n - N for N !=0 and 0 for N=0. Find the tens complement of the decimal number 13250 Q2. Calculate 72530 13250 using tens complement ar
Drexel - ECEC - 355
C-355 Makeup Homework Assignment # 3 (Due 07/31 in class)Q1. The r's complement of an n-digit number in base r is defined as r n - N for N !=0 and 0 for N=0. Find the tens complement of the decimal number 36744 Q2. Calculate 72530 36744 using tens comple
Drexel - ECEC - 355
C-355 Solutions to HW # 3Problem 9.7 r = 10, n = 5, N = 13250 So 10's complement of 13250 is 10 5 - 13250 = 86750 Problem 9.8 72530 13250 = 72530 + 10's complement of 13250 = 72530 + 86750 = 159280 Since carry arises the answer is obtained by ignoring it
Drexel - ECEC - 355
C-355 Homework # 4 (Due 07/24)Q1. Use the unsigned multiplication algorithm to compute 23 X 29 with the minimum number of bits. Q2. Given x = 0101 and y = 1010 in two's complement notation , compute the product p = x X y with Booth's algorithm Q3. Use th
Drexel - ECEC - 355
C-355 Makeup Homework # 4 (Due 07/31 in class)Q1. Use the unsigned multiplication algorithm to compute 21 X 27 with the minimum number of bits. Q2. Given x = 0111 and y = 1110 in two's complement notation , compute the product p = x X y with Booth's algo
Drexel - ECEC - 355
C-355 Solutions to HW # 4Q1: 5 bits are needed to represent 23 and 29 in unsigned representation Multiplicand = 23 = 10111 Multiplier = 29 = 11101 C 0 0 0 0 0 0 1 0 1 0 A 00000 + 10111 10111 01011 00101 +10111 11100 01110 +10111 00101 10010 +10111 01001
Drexel - ECEC - 355
ECE-C355-701 Homework Assignment # 5 (Due 08/07 in class)Q1. Problem 10.6 Q2. Consider the following 8086 code: MOV X, 49h MOV Y, 37h CMP Y , X Determine the status of the flags S,Z,O,C after this code executes. Q3 In the previous problem, the CMP instru
Drexel - ECEC - 355
C-355-701 Solutions to HW # 5Q1. Problem 10.6 The memory map is as follows: M1 A M2 B M3 C M4 D M5 E M6 F M7 X = (A+BC)/(D-EF) M8 temp (extra location to store intermediate result)3- address code: MUL M7,M2,M3 ADD M7,M7,M1 MUL M8,M5,M6 SUB M8, M4,M8 DIV
Drexel - ECEC - 355
C-355-701 Homework Assignment # 6 (Due Tuesday 08/21/07)Q1. Problem 11.1 Q2. Problem 11.4 Q3. Problem 11.5 Q4. Problem 3.1 Q5. Problem 3.2
Drexel - ECEC - 355
ECE-C355-701 Computer Structures Midterm # 1 (07/24/2007) Time: One Hour Max Points: 25 Answer all questionsQ1. (i) Convert 43.37510 to binary, octal and hexadecimal (3) (ii) Convert 345.678 to binary , hexadecimal and decimal (3) Q2. (i) Develop the tru
Drexel - ECEC - 355
ECE-C355-701Solutions to Midterm # 1Q1(i) 43 = 32+11 = 32 +8+2+1 gives 101011 2 Convert fractional part by repeated multiplication 0.375 X 2 0|.750 X2 1|.500 X 2 1|.000 which gives 0.375 = 0.011 2 So 43.375 = 101011.011 2 = 10|1011|.0110| = 2B.6 hex = 1
Drexel - ECEE - 354
Introduction to Wireless and Optical ElectronicsECE-E-354 Winter-Term Lecture: TR 2:00 to 3:20 pm @ Curtis Hall 452A F 10:00 10:50 pm @ Curtis Hall 344Instructor: Prof. Afshin S. Daryoush Office: 7-501 Tel: (215)895-2362 daryoush@ece.drexel.edu Office
Drexel - ECEE - 354
Drexel - ECEE - 354
Drexel - ECEE - 354
Drexel - ECEE - 354
Drexel - ECEE - 354
Drexel - ECEE - 354
Drexel - ECEE - 354
Drexel - ECEE - 354
Drexel - ECEE - 354
Drexel - ECEE - 354
Drexel - ECEE - 354
Introduction to Wireless and Optical ElectronicsECE-E354, Winter 2001Instructor: Warren Rosen Office: 7-506A Office Hours: Tu, Th: 5:00-6:00, or by appointment Tel: (215) 895-6604 Email: wrosen@ece.drexel.edu Prerequisite: ECE-E304 Text: Fundamentals of
Drexel - CS - 171
CS 171 Homework 1 - Introduction to C+CS 171 Computer Programming I, Fall 2006 Sections 60, 61, 501, 701 Homework 1 - Introduction to C+by Adelaida MedlockDue Date: Late Due Date:Wednesday, October 11, 2006, 11:59 PM Thursday, October 12, 2006, 11:59
Drexel - CS - 171
Homework 2 - Fun With Strings, Characters, and ArithmeticCS 171 Computer Programming I, Fall 2006 Sect: 60, 61, 501, 701 Homework 2 - Fun With StringsInstructor: Adelaida A. MedlockDue Date: Late Due Date:Wednesday, October 25, 2006, 11:59 PM Thursday
Drexel - CS - 171
CS 171 Homework 3 - Files and basic control of flowCS 171 Computer Programming I, Fall 2006 Sect: 60, 61, 501, 701 Homework 3 - Files and basic control of flowby Adelaida A. MedlockDue Date: Late Due Date:Wednesday, November 1, 2006, 11:59 PM Thursday
Drexel - CS - 171
CS 171 Homework 4 - SubprogramsCS 171 Computer Programming I, Fall 2006 Sect: 60, 61, 501, 701 Homework 4 - Subprogramsby Adelaida A. MedlockDue Date: Late Due Date:Wednesday, November 15, 2006, 11:59 PM Thursday, November 16, 2006, 11:59 PMPURPOSE:
Drexel - CS - 171
CS 171 Homework 5 - LoopsCS 171 Computer Programming I, Fall 2006 Sect: 60, 61, 501, 701 Homework 5 - Loopsby Adelaida A. MedlockDue Date: Late Due Date:Wednesday, December 6, 2006, 11:59 PM Thursday, December 7, 2006, 11:59 PMPURPOSE: After completi
Drexel - ECE - 203
A Java Class for Complex Numberspublic class Complex cfw_ private double u; private double v; /1a. Constructor to create a new complex number x+iy. Complex (double x, double y) cfw_u=x;v=y;/1b. Constructor to create a new complex number x+i0 where the
Drexel - ECE - 203
1 Chapter 1: Primary Storage: RAM: fast, expensive, loses data when power is turned off Secondary storage: hard disk: ROM: cheaper, no loss of data when power is off Machine instructions are encoded as numbers. Virtual machine decodes them and performs ta