Cloudera_Data_Analyst_Training.pdf -...

  • No School
  • AA 1
  • SargentSeahorse3255
  • 680
  • 100% (1) 1 out of 1 people found this document helpful

This preview shows page 1 out of 680 pages.

You've reached the end of your free preview.

Want to read all 680 pages?

Unformatted text preview: Cloudera"Data"Analyst"Training:"" Using"Pig,"Hive,"and"Impala"with"Hadoop" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#1$ 201410" IntroducIon" Chapter"1" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#2$ Course"Chapters" !! Introduc/on$ !! Hadoop"Fundamentals" Course$Introduc/on$ !! IntroducIon"to"Pig" !! Basic"Data"Analysis"with"Pig" !! Processing"Complex"Data"with"Pig" !! MulI/Dataset"OperaIons"with"Pig" !! Pig"TroubleshooIng"and"OpImizaIon" Data"ETL"and"Analysis"With"Pig"" !! IntroducIon"to"Impala"and"Hive" !! Querying"With"Impala"and"Hive" !! Impala"and"Hive"Data"Management" !! Data"Storage"and"Performance" IntroducIon"to"Impala"and"Hive" !! RelaIonal"Data"Analysis"With"Impala"and"Hive" !! Working"with"Impala"" !! Analyzing"Text"and"Complex"Data"with"Hive" !! Hive"OpImizaIon" !! Extending"Hive" !! Choosing"the"Best"Tool"for"the"Job" !! Conclusion" Data"Analysis"With"Impala"and"Hive" Course"Conclusion" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#3$ Chapter"Topics" Introduc/on$ Course$Introduc/on$ !! About$This$Course$ !! About"Cloudera" !! Course"LogisIcs" !! IntroducIons" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#4$ Course"ObjecIves"(1)" During$this$course,$you$will$learn$ ! The$purpose$of$Hadoop$and$its$related$tools$ ! The$features$that$Pig,$Hive,$and$Impala$offer$for$data$acquisi/on,$storage,$ and$analysis$ ! How$to$iden/fy$typical$use$cases$for$large#scale$data$analysis$ ! How$to$load$data$from$rela/onal$databases$and$other$sources$ ! How$to$manage$data$in$HDFS$and$export$it$for$use$with$other$systems$ ! How$Pig,$Hive,$and$Impala$improve$produc/vity$for$typical$analysis$tasks$ ! The$language$syntax$and$data$formats$supported$by$these$tools$ ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#5$ Course"ObjecIves"(2)" ! How$to$design$and$execute$queries$on$data$stored$in$HDFS$ ! How$to$join$diverse$datasets$to$gain$valuable$business$insight$ ! How$Hive$and$Impala$can$be$extended$with$custom$func/ons$and$scripts$ ! How$to$analyze$structured,$semi#structured,$and$unstructured$data$ ! How$to$store$and$query$data$for$bePer$performance$ ! How$to$determine$which$tool$is$the$best$choice$for$a$given$task$ ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#6$ Chapter"Topics" Introduc/on$ Course$Introduc/on$ !! About"This"Course" !! About$Cloudera$ !! Course"LogisIcs" !! IntroducIons" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#7$ About"Cloudera"(1)" ! The$leader$in$Apache$Hadoop#based$soSware$and$services$ ! Founded$by$leading$experts$on$Hadoop$from$Facebook,$Yahoo,$Google,$ and$Oracle$ ! Provides$support,$consul/ng,$training,$and$cer/fica/on$for$Hadoop$users$ ! Staff$includes$commiPers$to$virtually$all$Hadoop$projects$ ! Many$authors$of$industry$standard$books$on$Apache$Hadoop$projects$ – Tom"White,"Lars"George,"Kathleen"Ting,"etc." ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#8$ About"Cloudera"(2)" ! Customers$include$many$key$users$of$Hadoop$ – Allstate,"AOL"AdverIsing,"Box,"BT,"CBS"InteracIve,"eBay,"Experian,"FICO," Groupon,"MasterCard,"NaIonal"Cancer"InsItute,"Orbitz,"Social"Security" AdministraIon,"Trend"Micro,"Trulia,"US"Army,"…" ! Cloudera$public$training:$ – Cloudera"Developer"Training"for"Apache"Hadoop" – Cloudera"Developer"Training"for"Apache"Spark" – Designing"and"Building"Big"Data"ApplicaIons" – Cloudera"Administrator"Training"for"Apache"Hadoop" – Cloudera"Data"Analyst"Training:"Using"Pig,"Hive,"and"Impala"with"Hadoop" – Cloudera"Training"for"Apache"HBase" – IntroducIon"to"Data"Science:"Building"Recommender"Systems" – Cloudera"EssenIals"for"Apache"Hadoop" ! Onsite$and$custom$training$is$also$available$ ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#9$ CDH" ! CDH$(Cloudera’s$Distribu/on,$including$Apache$Hadoop)$ – 100%"open"source,"enterprise/ready"distribuIon"of"Hadoop"and"" related"projects" – The"most"complete,"tested,"and"widely/deployed"distribuIon"of"Hadoop" – Integrates"all"key"Hadoop"ecosystem"projects" – Available"as"RPMs"and"Ubuntu/Debian/SuSE"packages"or"as"a"tarball" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#10$ Cloudera"Express" ! Cloudera$Express$ – Free"download" ! The$best$way$to$get$started$ $with$Hadoop$ ! Includes$CDH$ ! Includes$Cloudera$Manager$ – End/to/end"" administraIon"for"" Hadoop" – Deploy,"manage,"and"" monitor"your"cluster" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#11$ Cloudera"Enterprise" ! Cloudera$Enterprise$ – SubscripIon"product"including"CDH"and"" Cloudera"Manager" ! Includes$support$ ! Includes$extra$Cloudera$Manager$features$ – ConfiguraIon"history"and"rollbacks" – Rolling"updates" – LDAP"integraIon" – SNMP"support" – Automated"disaster"recovery" – Etc." ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#12$ Chapter"Topics" Introduc/on$ Course$Introduc/on$ !! About"This"Course" !! About"Cloudera" !! Course$Logis/cs$ !! IntroducIons" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#13$ LogisIcs" ! Class$start$and$finish$/mes$ ! Lunch$ ! Breaks$ ! Restrooms$ ! Wi#Fi$access$ ! Virtual$machines$ ! Can$I$come$in$early/stay$late?$ Your$instructor$will$give$you$details$on$how$to$access$the$course$materials$ and$exercise$instruc/ons$for$the$class$ ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#14$ Chapter"Topics" Introduc/on$ Course$Introduc/on$ !! About"This"Course" !! About"Cloudera" !! Course"LogisIcs" !! Introduc/ons$ ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#15$ IntroducIons" ! About$your$instructor$ ! About$you$ – Where"do"you"work"and"what"do"you"do"there?" – Which"database(s)"and"pladorm(s)"do"you"use?" – Have"you"worked"with"Apache"Hadoop"or"related"tools?""" – Any"experience"as"a"developer?" – What"programming"languages"do"you"use?" – What"are"your"expectaIons"for"this"course?" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#16$ Hadoop"Fundamentals" Chapter"2" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#1% Course"Chapters" !! IntroducDon" !! Hadoop%Fundamentals% Course%Introduc7on% !! IntroducDon"to"Pig" !! Basic"Data"Analysis"with"Pig" !! Processing"Complex"Data"with"Pig" !! MulD/Dataset"OperaDons"with"Pig" !! Pig"TroubleshooDng"and"OpDmizaDon" Data"ETL"and"Analysis"With"Pig"" !! IntroducDon"to"Impala"and"Hive" !! Querying"With"Impala"and"Hive" !! Impala"and"Hive"Data"Management" !! Data"Storage"and"Performance" IntroducDon"to"Impala"and"Hive" !! RelaDonal"Data"Analysis"With"Impala"and"Hive" !! Working"with"Impala"" !! Analyzing"Text"and"Complex"Data"with"Hive" !! Hive"OpDmizaDon" !! Extending"Hive" !! Choosing"the"Best"Tool"for"the"Job" !! Conclusion" Data"Analysis"With"Impala"and"Hive" Course"Conclusion" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#2% Hadoop"Fundamentals" In%this%chapter,%you%will%learn% ! Which%factors%led%to%the%era%of%Big%Data% ! What%Hadoop%is%and%what%significant%features%it%offers% ! How%Hadoop%offers%reliable%storage%for%massive%amounts%of%data%with% HDFS% ! How%Hadoop%supports%large#scale%data%processing%through%MapReduce% ! How%‘Hadoop%Ecosystem’%tools%can%boost%an%analyst’s%produc7vity% ! Several%ways%to%integrate%Hadoop%into%the%modern%data%center% ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#3% Chapter"Topics" Hadoop%Fundamentals% Course%Introduc7on% !! The%Mo7va7on%for%Hadoop% !! Hadoop"Overview" !! Data"Storage:"HDFS" !! Distributed"Data"Processing:"YARN,"MapReduce,"and"Spark" !! Data"Processing"and"Analysis:"Pig,"Hive,"and"Impala" !! Database"IntegraDon:"Sqoop" !! Other"Hadoop"Data"Tools" !! Exercise"Scenario"ExplanaDon" !! Conclusion" !! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#4% Velocity" ! We%are%genera7ng%data%faster%than%ever% – Processes"are"increasingly"automated" – Systems"are"increasingly"interconnected" – People"are"increasingly"interacDng"online" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#5% Variety" ! We%are%producing%a%wide%variety%of%data% – Social"network"connecDons" – Server"and"applicaDon"log"files" – Electronic"medical"records" – Images,"audio,"and"video" – RFID"and"wireless"sensor"network"events" – Product"raDngs"on"shopping"and"review"Web"sites" – And"much"more…" ! Not%all%of%this%maps%cleanly%to%the%rela7onal%model% ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#6% Volume" ! Every%day…% – More"than"1.5"billion"shares"are"traded"on"the"New"York"Stock" Exchange" – Facebook"stores"2.7"billion"comments"and"‘Likes’" – Google"processes"about"24"petabytes"of"data" ! Every%minute…% – Foursquare"handles"more"than"2,000"check/ins" – TransUnion"makes"nearly"70,000"updates"to"credit"files" ! And%every%second…% – Banks"process"more"than"10,000"credit"card"transacDons" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#7% Data"Has"Value" ! This%data%has%many%valuable%applica7ons% – Product"recommendaDons" – PredicDng"demand" – MarkeDng"analysis" – Fraud"detecDon" – And"many,"many"more…" ! We%must%process%it%to%extract%that%value% – And"processing"all#the#data"can"yield"more"accurate"results" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#8% We"Need"a"System"that"Scales" ! We’re%genera7ng%too%much%data%to%process%with%tradi7onal%tools% ! Two%key%problems%to%address%% – How"can"we"reliably"store"large"amounts"of"data"at"a"reasonable"cost?" – How"can"we"analyze"all"the"data"we"have"stored?" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#9% Chapter"Topics" Hadoop%Fundamentals% Course%Introduc7on% !! The"MoDvaDon"for"Hadoop" !! Hadoop%Overview% !! Data"Storage:"HDFS" !! Distributed"Data"Processing:"YARN,"MapReduce,"and"Spark" !! Data"Processing"and"Analysis:"Pig,"Hive,"and"Impala" !! Database"IntegraDon:"Sqoop" !! Other"Hadoop"Data"Tools" !! Exercise"Scenario"ExplanaDon" !! Conclusion" !! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#10% What"is"Apache"Hadoop?" ! Scalable%and%economical%data%storage%and%processing% – Distributed"and"fault/tolerant"" – Harnesses"the"power"of"industry"standard"hardware" ! Heavily%inspired%by%technical%documents%published%by%Google% Batch" Processing" (MapReduce," Hive,"Pig)" Search"Engine" (Cloudera" Search)" AnalyDc"SQL" (Impala)" Machine" Learning" (Spark,"Mahout)" Stream" Processing" (Spark)" Other" ApplicaDons" Workload"Management"(YARN)" Data"Storage" Filesystem" (HDFS)" Online"NoSQL" (HBase)" Data"IntegraDon"(Sqoop,"Flume)" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#11% Scalability" ! Hadoop%is%a%distributed%system% – A"collecDon"of"servers"running"Hadoop"sogware"is"called"a"cluster# ! Individual%servers%within%a%cluster%are%called%nodes& – Typically"standard"rackmount"servers"running"Linux" – Each"node"both"stores"and"processes"data" ! Add%more%nodes%to%the%cluster%to%increase%scalability% – A"cluster"may"contain"up"to"several"thousand"nodes" – You"can"scale"out"incrementally"as"required" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#12% Fault"Tolerance" ! Paradox:%Adding%nodes%increases%the%chance%that%any%one%of%them%will%fail% – SoluDon:"build"redundancy"into"the"system"and"handle"it"automaDcally" ! Files%loaded%into%HDFS%are%replicated%across%nodes%in%the%cluster% – If"a"node"fails,"its"data"is"re/replicated"using"one"of"the"other"copies" ! Data%processing%jobs%are%broken%into%individual%tasks% – Each"task"takes"a"small"amount"of"data"as"input" – Thousands"of"tasks"(or"more)"ogen"run"in"parallel" – If"a"node"fails"during"processing,"its"tasks"are"rescheduled"elsewhere" ! Rou7ne%failures%are%handled%automa7cally%without%any%loss%of%data% ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#13% Chapter"Topics" Hadoop%Fundamentals% Course%Introduc7on% !! The"MoDvaDon"for"Hadoop" !! Hadoop"Overview" !! Data%Storage:%HDFS% !! Distributed"Data"Processing:"YARN,"MapReduce,"and"Spark" !! Data"Processing"and"Analysis:"Pig,"Hive,"and"Impala" !! Database"IntegraDon:"Sqoop" !! Other"Hadoop"Data"Tools" !! Exercise"Scenario"ExplanaDon" !! Conclusion" !! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#14% HDFS:"Hadoop"Distributed"File"System" ! HDFS%provides%the%storage%layer%for%Hadoop%data%processing% ! Provides%inexpensive%and%reliable%storage%for%massive%amounts%of%data% ! Other%Hadoop%components%work%with%data%in%HDFS% – MapReduce,"Impala,"Hive,"Pig,"Spark,"etc."" Batch" Processing" (MapReduce," Hive,"Pig)" Search"Engine" (Cloudera" Search)" AnalyDc"SQL" (Impala)" Machine" Learning" (Spark,"Mahout)" Stream" Processing" (Spark)" Other" ApplicaDons" Workload"Management"(YARN)" Data"Storage" Filesystem" (HDFS)" Online"NoSQL" (HBase)" Data"IntegraDon"(Sqoop,"Flume)" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#15% HDFS"Features" ! Op7mized%for%sequen7al%access%to%a%rela7vely%small%number%of%large%files% – Each"file"is"likely"to"be"100MB"or"larger "" – MulD/gigabyte"files"are"typical" ! In%some%ways,%HDFS%is%similar%to%a%UNIX%filesystem% – Hierarchical,"with"UNIX/style"paths"(e.g.,"/sales/rpt/asia.txt)" – UNIX/style"file"ownership"and"permissions" ! There%are%also%some%major%devia7ons%from%UNIX% – No"concept"of"a"current"directory" – Cannot"modify"files"once"wri>en" – Must"use"Hadoop/specific"uDliDes"or"custom"code"to"access"HDFS" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#16% HDFS"Architecture" ! Hadoop%has%a%master/slave% architecture% op ! HDFS%master%daemon:%NameNode% fs -put sales.txt /reports Hadoop Cluster A#Small#Hadoop#Cluster# Master% HDFS#master#daemon# – Manages"namespace"and"metadata# – Monitors"slave"nodes" ! HDFS%slave%daemon:%DataNode% – Reads"and"writes"the"actual"data" Slaves& HDFS#slave#daemons# op fs -get /reports/sales.txt ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#17% Accessing"HDFS"via"the"Command"Line" ! HDFS%is%not%a%general%purpose%filesystem% – Not"built"into"the"OS,"so"only"specialized"tools"can"access"it" – End"users"typically"access"HDFS"via"the"hdfs dfs command" ! Example:%display%the%contents%of%the%/user/fred/sales.txt%file% $ hdfs dfs -cat /user/fred/sales.txt ! Example:%Create%a%directory%(below%the%root)%called%reports% $ hdfs dfs -mkdir /reports ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#18% Copying"Local"Data"To"and"From"HDFS" ! Remember%that%HDFS%is%dis7nct%from%your%local%filesystem% – Use"hdfs dfs –put%to"copy"local"files"to"HDFS" – Use"hdfs dfs -get%to"fetch"a"local"copy"of"a"file"from"HDFS" Hadoop Cluster Hadoop Cluster Hadoop#Cluster# $ hadoop -put sales.txt /reports $ hadoop fs -putfssales.txt /reports Client Machine Client# Client Machine $ hdfs dfs -put file $ hadoop fs/reports/sales.txt -get /reports/sales.txt $ hdfs dfs -get file $ hadoop fs -get ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#19% More"hdfs dfs"Command"Examples"" ! Copy%file%input.txt%from%local%disk%to%the%user’s%directory%in%HDFS% $ hdfs dfs -put input.txt input.txt – This"will"copy"the"file"to"/user/username/input.txt ! Get%a%directory%lis7ng%of%the%HDFS%root%directory% $ hdfs dfs -ls / ! Delete%the%file%/reports/sales.txt% $ hdfs dfs -rm /reports/sales.txt ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#20% Using"the"Hue"HDFS"File"Manager" ! Hue%is%a%Web%interface%for%Hadoop% – Hadoop"User"Experience" ! Hue%includes%an%applica7on%for%browsing%and%managing%files%in%HDFS% – To"use"Hue,"browse"to" Manage"Files" Upload"Files" Browse"Files" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#21% Chapter"Topics" Hadoop%Fundamentals% Course%Introduc7on% !! The"MoDvaDon"for"Hadoop" !! Hadoop"Overview" !! Data"Storage:"HDFS" !! Distributed%Data%Processing:%YARN,%MapReduce,%and%Spark% !! Data"Processing"and"Analysis:"Pig,"Hive,"and"Impala" !! Database"IntegraDon:"Sqoop" !! Other"Hadoop"Data"Tools" !! Exercise"Scenario"ExplanaDon" !! Conclusion" !! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#22% Workload"Management:"YARN" ! Many%Hadoop%tools%work%with%data%in%a%Hadoop%cluster% ! Requires%workload%management%to%distribute%and%monitor%work%across% the%cluster% Batch" Processing" (MapReduce," Hive,"Pig)" Search"Engine" (Cloudera" Search)" AnalyDc"SQL" (Impala)" Machine" Learning" (Spark,"Mahout)" Stream" Processing" (Spark)" Other" ApplicaDons" Workload"Management"(YARN"or"MapReduce"1)" Data"Storage" Filesystem" (HDFS)" Online"NoSQL" (HBase)" Data"IntegraDon"(Sqoop,"Flume)" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#23% Hadoop"Cluster"Architecture" ! Master/Slave%Architecture% – YARN"or"MapReduce"version"1" op fs -put sales.txt /reports – Details"differ"slightly" Hadoop Cluster A#Small#Hadoop#Cluster# Master% YARN&master&daemon& HDFS#master#daemon# ! Master%nodes% – Run"master"daemons"to"accept"jobs,"" and"monitor"and"distribute"work" ! Slave%nodes% – Run"slave"daemons"to"start"tasks" – Do"the"actual"work" op fs -get /reports/sales.txt – Report"status"back"to"master"daemons" Slaves% YARN&slave&daemons& HDFS#slave#daemons# ! HDFS%and%YARN/MRv1%are%collocated% – Slave"nodes"run"both"HDFS"and"slave" daemons"on"the"same"machines" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#24% General"Data"Processing" ! Hadoop%includes%two%general%data%processing%engines% – MapReduce" – Spark" ! Both%are%programming%libraries%(Java,%Scala,%Python…)% Batch" Processing" (MapReduce," Hive,"Pig)" Search"Engine" (Cloudera" Search)" AnalyDc"SQL" (Impala)" Machine" Learning" (Spark,"Mahout)" Stream" Processing" (Spark)" Other" ApplicaDons" Workload"Management"(YARN"or"MapReduce)" Data"Storage" Filesystem" (HDFS)" Online"NoSQL" (HBase)" Data"IntegraDon"(Sqoop,"Flume)" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#25% Hadoop"MapReduce" ! Hadoop%MapReduce%was%the%original%processing%engine%for%Hadoop% – SDll"the"most"commonly"used"general"data"processing"engine" ! Based%on%the%the%‘map#reduce’%programming%model% – A"style"of"processing"data"popularized"by"Go...
View Full Document

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture