You've reached the end of your free preview.
Want to read all 680 pages?
Unformatted text preview: Cloudera"Data"Analyst"Training:""
Using"Pig,"Hive,"and"Impala"with"Hadoop" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#1$ 201410" IntroducIon"
Chapter"1" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#2$ Course"Chapters"
!! Introduc/on$
!! Hadoop"Fundamentals" Course$Introduc/on$ !! IntroducIon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulI/Dataset"OperaIons"with"Pig"
!! Pig"TroubleshooIng"and"OpImizaIon" Data"ETL"and"Analysis"With"Pig"" !! IntroducIon"to"Impala"and"Hive"
!! Querying"With"Impala"and"Hive"
!! Impala"and"Hive"Data"Management"
!! Data"Storage"and"Performance" IntroducIon"to"Impala"and"Hive" !! RelaIonal"Data"Analysis"With"Impala"and"Hive"
!! Working"with"Impala""
!! Analyzing"Text"and"Complex"Data"with"Hive"
!! Hive"OpImizaIon"
!! Extending"Hive"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion" Data"Analysis"With"Impala"and"Hive" Course"Conclusion" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#3$ Chapter"Topics"
Introduc/on$ Course$Introduc/on$ !! About$This$Course$
!! About"Cloudera"
!! Course"LogisIcs"
!! IntroducIons" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#4$ Course"ObjecIves"(1)"
During$this$course,$you$will$learn$
! The$purpose$of$Hadoop$and$its$related$tools$
! The$features$that$Pig,$Hive,$and$Impala$offer$for$data$acquisi/on,$storage,$
and$analysis$
! How$to$iden/fy$typical$use$cases$for$large#scale$data$analysis$
! How$to$load$data$from$rela/onal$databases$and$other$sources$
! How$to$manage$data$in$HDFS$and$export$it$for$use$with$other$systems$
! How$Pig,$Hive,$and$Impala$improve$produc/vity$for$typical$analysis$tasks$
! The$language$syntax$and$data$formats$supported$by$these$tools$ ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#5$ Course"ObjecIves"(2)"
! How$to$design$and$execute$queries$on$data$stored$in$HDFS$
! How$to$join$diverse$datasets$to$gain$valuable$business$insight$
! How$Hive$and$Impala$can$be$extended$with$custom$func/ons$and$scripts$
! How$to$analyze$structured,$semi#structured,$and$unstructured$data$
! How$to$store$and$query$data$for$bePer$performance$
! How$to$determine$which$tool$is$the$best$choice$for$a$given$task$ ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#6$ Chapter"Topics"
Introduc/on$ Course$Introduc/on$ !! About"This"Course"
!! About$Cloudera$
!! Course"LogisIcs"
!! IntroducIons" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#7$ About"Cloudera"(1)"
! The$leader$in$Apache$Hadoop#based$soSware$and$services$
! Founded$by$leading$experts$on$Hadoop$from$Facebook,$Yahoo,$Google,$
and$Oracle$
! Provides$support,$consul/ng,$training,$and$cer/fica/on$for$Hadoop$users$
! Staff$includes$commiPers$to$virtually$all$Hadoop$projects$
! Many$authors$of$industry$standard$books$on$Apache$Hadoop$projects$
– Tom"White,"Lars"George,"Kathleen"Ting,"etc." ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#8$ About"Cloudera"(2)"
! Customers$include$many$key$users$of$Hadoop$
– Allstate,"AOL"AdverIsing,"Box,"BT,"CBS"InteracIve,"eBay,"Experian,"FICO,"
Groupon,"MasterCard,"NaIonal"Cancer"InsItute,"Orbitz,"Social"Security"
AdministraIon,"Trend"Micro,"Trulia,"US"Army,"…"
! Cloudera$public$training:$
– Cloudera"Developer"Training"for"Apache"Hadoop"
– Cloudera"Developer"Training"for"Apache"Spark"
– Designing"and"Building"Big"Data"ApplicaIons"
– Cloudera"Administrator"Training"for"Apache"Hadoop"
– Cloudera"Data"Analyst"Training:"Using"Pig,"Hive,"and"Impala"with"Hadoop"
– Cloudera"Training"for"Apache"HBase"
– IntroducIon"to"Data"Science:"Building"Recommender"Systems"
– Cloudera"EssenIals"for"Apache"Hadoop"
! Onsite$and$custom$training$is$also$available$
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#9$ CDH"
! CDH$(Cloudera’s$Distribu/on,$including$Apache$Hadoop)$
– 100%"open"source,"enterprise/ready"distribuIon"of"Hadoop"and""
related"projects"
– The"most"complete,"tested,"and"widely/deployed"distribuIon"of"Hadoop"
– Integrates"all"key"Hadoop"ecosystem"projects"
– Available"as"RPMs"and"Ubuntu/Debian/SuSE"packages"or"as"a"tarball" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#10$ Cloudera"Express"
! Cloudera$Express$
– Free"download"
! The$best$way$to$get$started$
$with$Hadoop$
! Includes$CDH$
! Includes$Cloudera$Manager$
– End/to/end""
administraIon"for""
Hadoop"
– Deploy,"manage,"and""
monitor"your"cluster" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#11$ Cloudera"Enterprise"
! Cloudera$Enterprise$
– SubscripIon"product"including"CDH"and""
Cloudera"Manager"
! Includes$support$
! Includes$extra$Cloudera$Manager$features$
– ConfiguraIon"history"and"rollbacks"
– Rolling"updates"
– LDAP"integraIon"
– SNMP"support"
– Automated"disaster"recovery"
– Etc." ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#12$ Chapter"Topics"
Introduc/on$ Course$Introduc/on$ !! About"This"Course"
!! About"Cloudera"
!! Course$Logis/cs$
!! IntroducIons" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#13$ LogisIcs"
! Class$start$and$finish$/mes$
! Lunch$
! Breaks$
! Restrooms$
! Wi#Fi$access$
! Virtual$machines$
! Can$I$come$in$early/stay$late?$ Your$instructor$will$give$you$details$on$how$to$access$the$course$materials$
and$exercise$instruc/ons$for$the$class$
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#14$ Chapter"Topics"
Introduc/on$ Course$Introduc/on$ !! About"This"Course"
!! About"Cloudera"
!! Course"LogisIcs"
!! Introduc/ons$ ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#15$ IntroducIons"
! About$your$instructor$
! About$you$
– Where"do"you"work"and"what"do"you"do"there?"
– Which"database(s)"and"pladorm(s)"do"you"use?"
– Have"you"worked"with"Apache"Hadoop"or"related"tools?"""
– Any"experience"as"a"developer?"
– What"programming"languages"do"you"use?"
– What"are"your"expectaIons"for"this"course?" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 01#16$ Hadoop"Fundamentals"
Chapter"2" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#1% Course"Chapters"
!! IntroducDon"
!! Hadoop%Fundamentals% Course%Introduc7on% !! IntroducDon"to"Pig"
!! Basic"Data"Analysis"with"Pig"
!! Processing"Complex"Data"with"Pig"
!! MulD/Dataset"OperaDons"with"Pig"
!! Pig"TroubleshooDng"and"OpDmizaDon" Data"ETL"and"Analysis"With"Pig"" !! IntroducDon"to"Impala"and"Hive"
!! Querying"With"Impala"and"Hive"
!! Impala"and"Hive"Data"Management"
!! Data"Storage"and"Performance" IntroducDon"to"Impala"and"Hive" !! RelaDonal"Data"Analysis"With"Impala"and"Hive"
!! Working"with"Impala""
!! Analyzing"Text"and"Complex"Data"with"Hive"
!! Hive"OpDmizaDon"
!! Extending"Hive"
!! Choosing"the"Best"Tool"for"the"Job"
!! Conclusion" Data"Analysis"With"Impala"and"Hive" Course"Conclusion" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#2% Hadoop"Fundamentals"
In%this%chapter,%you%will%learn%
! Which%factors%led%to%the%era%of%Big%Data%
! What%Hadoop%is%and%what%significant%features%it%offers%
! How%Hadoop%offers%reliable%storage%for%massive%amounts%of%data%with%
HDFS%
! How%Hadoop%supports%large#scale%data%processing%through%MapReduce%
! How%‘Hadoop%Ecosystem’%tools%can%boost%an%analyst’s%produc7vity%
! Several%ways%to%integrate%Hadoop%into%the%modern%data%center% ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#3% Chapter"Topics"
Hadoop%Fundamentals% Course%Introduc7on% !! The%Mo7va7on%for%Hadoop%
!! Hadoop"Overview"
!! Data"Storage:"HDFS"
!! Distributed"Data"Processing:"YARN,"MapReduce,"and"Spark"
!! Data"Processing"and"Analysis:"Pig,"Hive,"and"Impala"
!! Database"IntegraDon:"Sqoop"
!! Other"Hadoop"Data"Tools"
!! Exercise"Scenario"ExplanaDon"
!! Conclusion"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#4% Velocity"
! We%are%genera7ng%data%faster%than%ever%
– Processes"are"increasingly"automated"
– Systems"are"increasingly"interconnected"
– People"are"increasingly"interacDng"online" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#5% Variety"
! We%are%producing%a%wide%variety%of%data%
– Social"network"connecDons"
– Server"and"applicaDon"log"files"
– Electronic"medical"records"
– Images,"audio,"and"video"
– RFID"and"wireless"sensor"network"events"
– Product"raDngs"on"shopping"and"review"Web"sites"
– And"much"more…"
! Not%all%of%this%maps%cleanly%to%the%rela7onal%model% ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#6% Volume"
! Every%day…%
– More"than"1.5"billion"shares"are"traded"on"the"New"York"Stock"
Exchange"
– Facebook"stores"2.7"billion"comments"and"‘Likes’"
– Google"processes"about"24"petabytes"of"data"
! Every%minute…%
– Foursquare"handles"more"than"2,000"check/ins"
– TransUnion"makes"nearly"70,000"updates"to"credit"files"
! And%every%second…%
– Banks"process"more"than"10,000"credit"card"transacDons" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#7% Data"Has"Value"
! This%data%has%many%valuable%applica7ons%
– Product"recommendaDons"
– PredicDng"demand"
– MarkeDng"analysis"
– Fraud"detecDon"
– And"many,"many"more…"
! We%must%process%it%to%extract%that%value%
– And"processing"all#the#data"can"yield"more"accurate"results" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#8% We"Need"a"System"that"Scales"
! We’re%genera7ng%too%much%data%to%process%with%tradi7onal%tools%
! Two%key%problems%to%address%%
– How"can"we"reliably"store"large"amounts"of"data"at"a"reasonable"cost?"
– How"can"we"analyze"all"the"data"we"have"stored?" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#9% Chapter"Topics"
Hadoop%Fundamentals% Course%Introduc7on% !! The"MoDvaDon"for"Hadoop"
!! Hadoop%Overview%
!! Data"Storage:"HDFS"
!! Distributed"Data"Processing:"YARN,"MapReduce,"and"Spark"
!! Data"Processing"and"Analysis:"Pig,"Hive,"and"Impala"
!! Database"IntegraDon:"Sqoop"
!! Other"Hadoop"Data"Tools"
!! Exercise"Scenario"ExplanaDon"
!! Conclusion"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#10% What"is"Apache"Hadoop?"
! Scalable%and%economical%data%storage%and%processing%
– Distributed"and"fault/tolerant""
– Harnesses"the"power"of"industry"standard"hardware"
! Heavily%inspired%by%technical%documents%published%by%Google%
Batch"
Processing"
(MapReduce,"
Hive,"Pig)" Search"Engine"
(Cloudera"
Search)" AnalyDc"SQL"
(Impala)" Machine"
Learning" (Spark,"Mahout)" Stream"
Processing"
(Spark)" Other"
ApplicaDons" Workload"Management"(YARN)"
Data"Storage"
Filesystem"
(HDFS)" Online"NoSQL"
(HBase)" Data"IntegraDon"(Sqoop,"Flume)" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#11% Scalability"
! Hadoop%is%a%distributed%system%
– A"collecDon"of"servers"running"Hadoop"sogware"is"called"a"cluster#
! Individual%servers%within%a%cluster%are%called%nodes&
– Typically"standard"rackmount"servers"running"Linux"
– Each"node"both"stores"and"processes"data"
! Add%more%nodes%to%the%cluster%to%increase%scalability%
– A"cluster"may"contain"up"to"several"thousand"nodes"
– You"can"scale"out"incrementally"as"required" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#12% Fault"Tolerance"
! Paradox:%Adding%nodes%increases%the%chance%that%any%one%of%them%will%fail%
– SoluDon:"build"redundancy"into"the"system"and"handle"it"automaDcally"
! Files%loaded%into%HDFS%are%replicated%across%nodes%in%the%cluster%
– If"a"node"fails,"its"data"is"re/replicated"using"one"of"the"other"copies"
! Data%processing%jobs%are%broken%into%individual%tasks%
– Each"task"takes"a"small"amount"of"data"as"input"
– Thousands"of"tasks"(or"more)"ogen"run"in"parallel"
– If"a"node"fails"during"processing,"its"tasks"are"rescheduled"elsewhere"
! Rou7ne%failures%are%handled%automa7cally%without%any%loss%of%data% ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#13% Chapter"Topics"
Hadoop%Fundamentals% Course%Introduc7on% !! The"MoDvaDon"for"Hadoop"
!! Hadoop"Overview"
!! Data%Storage:%HDFS%
!! Distributed"Data"Processing:"YARN,"MapReduce,"and"Spark"
!! Data"Processing"and"Analysis:"Pig,"Hive,"and"Impala"
!! Database"IntegraDon:"Sqoop"
!! Other"Hadoop"Data"Tools"
!! Exercise"Scenario"ExplanaDon"
!! Conclusion"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#14% HDFS:"Hadoop"Distributed"File"System"
! HDFS%provides%the%storage%layer%for%Hadoop%data%processing%
! Provides%inexpensive%and%reliable%storage%for%massive%amounts%of%data%
! Other%Hadoop%components%work%with%data%in%HDFS%
– MapReduce,"Impala,"Hive,"Pig,"Spark,"etc.""
Batch"
Processing"
(MapReduce,"
Hive,"Pig)" Search"Engine"
(Cloudera"
Search)" AnalyDc"SQL"
(Impala)" Machine"
Learning" (Spark,"Mahout)" Stream"
Processing"
(Spark)" Other"
ApplicaDons" Workload"Management"(YARN)"
Data"Storage"
Filesystem"
(HDFS)" Online"NoSQL"
(HBase)" Data"IntegraDon"(Sqoop,"Flume)" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#15% HDFS"Features"
! Op7mized%for%sequen7al%access%to%a%rela7vely%small%number%of%large%files%
– Each"file"is"likely"to"be"100MB"or"larger ""
– MulD/gigabyte"files"are"typical"
! In%some%ways,%HDFS%is%similar%to%a%UNIX%filesystem%
– Hierarchical,"with"UNIX/style"paths"(e.g.,"/sales/rpt/asia.txt)"
– UNIX/style"file"ownership"and"permissions"
! There%are%also%some%major%devia7ons%from%UNIX%
– No"concept"of"a"current"directory"
– Cannot"modify"files"once"wri>en"
– Must"use"Hadoop/specific"uDliDes"or"custom"code"to"access"HDFS" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#16% HDFS"Architecture"
! Hadoop%has%a%master/slave%
architecture% op ! HDFS%master%daemon:%NameNode%
fs -put sales.txt /reports Hadoop
Cluster
A#Small#Hadoop#Cluster#
Master%
HDFS#master#daemon# – Manages"namespace"and"metadata#
– Monitors"slave"nodes"
! HDFS%slave%daemon:%DataNode%
– Reads"and"writes"the"actual"data" Slaves&
HDFS#slave#daemons# op fs -get /reports/sales.txt ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#17% Accessing"HDFS"via"the"Command"Line"
! HDFS%is%not%a%general%purpose%filesystem%
– Not"built"into"the"OS,"so"only"specialized"tools"can"access"it"
– End"users"typically"access"HDFS"via"the"hdfs dfs command"
! Example:%display%the%contents%of%the%/user/fred/sales.txt%file%
$ hdfs dfs -cat /user/fred/sales.txt ! Example:%Create%a%directory%(below%the%root)%called%reports%
$ hdfs dfs -mkdir /reports ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#18% Copying"Local"Data"To"and"From"HDFS"
! Remember%that%HDFS%is%dis7nct%from%your%local%filesystem%
– Use"hdfs dfs –put%to"copy"local"files"to"HDFS"
– Use"hdfs dfs -get%to"fetch"a"local"copy"of"a"file"from"HDFS"
Hadoop Cluster
Hadoop Cluster
Hadoop#Cluster# $ hadoop
-put sales.txt
/reports
$ hadoop
fs -putfssales.txt
/reports
Client
Machine
Client#
Client Machine $ hdfs dfs -put file $ hadoop
fs/reports/sales.txt
-get
/reports/sales.txt
$ hdfs
dfs -get
file
$ hadoop
fs -get ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#19% More"hdfs dfs"Command"Examples""
! Copy%file%input.txt%from%local%disk%to%the%user’s%directory%in%HDFS%
$ hdfs dfs -put input.txt input.txt – This"will"copy"the"file"to"/user/username/input.txt
! Get%a%directory%lis7ng%of%the%HDFS%root%directory%
$ hdfs dfs -ls / ! Delete%the%file%/reports/sales.txt%
$ hdfs dfs -rm /reports/sales.txt ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#20% Using"the"Hue"HDFS"File"Manager"
! Hue%is%a%Web%interface%for%Hadoop%
– Hadoop"User"Experience"
! Hue%includes%an%applica7on%for%browsing%and%managing%files%in%HDFS%
– To"use"Hue,"browse"to" Manage"Files"
Upload"Files"
Browse"Files" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#21% Chapter"Topics"
Hadoop%Fundamentals% Course%Introduc7on% !! The"MoDvaDon"for"Hadoop"
!! Hadoop"Overview"
!! Data"Storage:"HDFS"
!! Distributed%Data%Processing:%YARN,%MapReduce,%and%Spark%
!! Data"Processing"and"Analysis:"Pig,"Hive,"and"Impala"
!! Database"IntegraDon:"Sqoop"
!! Other"Hadoop"Data"Tools"
!! Exercise"Scenario"ExplanaDon"
!! Conclusion"
!! Hands/On"Exercise:"Data"Ingest"with"Hadoop"Tools"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#22% Workload"Management:"YARN"
! Many%Hadoop%tools%work%with%data%in%a%Hadoop%cluster%
! Requires%workload%management%to%distribute%and%monitor%work%across%
the%cluster%
Batch"
Processing"
(MapReduce,"
Hive,"Pig)" Search"Engine"
(Cloudera"
Search)" AnalyDc"SQL"
(Impala)" Machine"
Learning" (Spark,"Mahout)" Stream"
Processing"
(Spark)" Other"
ApplicaDons" Workload"Management"(YARN"or"MapReduce"1)"
Data"Storage"
Filesystem"
(HDFS)" Online"NoSQL"
(HBase)" Data"IntegraDon"(Sqoop,"Flume)" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#23% Hadoop"Cluster"Architecture" ! Master/Slave%Architecture%
– YARN"or"MapReduce"version"1"
op fs -put
sales.txt /reports
– Details"differ"slightly" Hadoop
Cluster
A#Small#Hadoop#Cluster#
Master%
YARN&master&daemon&
HDFS#master#daemon# ! Master%nodes%
– Run"master"daemons"to"accept"jobs,""
and"monitor"and"distribute"work" ! Slave%nodes%
– Run"slave"daemons"to"start"tasks"
– Do"the"actual"work"
op fs -get /reports/sales.txt
– Report"status"back"to"master"daemons" Slaves%
YARN&slave&daemons&
HDFS#slave#daemons# ! HDFS%and%YARN/MRv1%are%collocated%
– Slave"nodes"run"both"HDFS"and"slave"
daemons"on"the"same"machines"
©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#24% General"Data"Processing"
! Hadoop%includes%two%general%data%processing%engines%
– MapReduce"
– Spark"
! Both%are%programming%libraries%(Java,%Scala,%Python…)%
Batch"
Processing"
(MapReduce,"
Hive,"Pig)" Search"Engine"
(Cloudera"
Search)" AnalyDc"SQL"
(Impala)" Machine"
Learning" (Spark,"Mahout)" Stream"
Processing"
(Spark)" Other"
ApplicaDons" Workload"Management"(YARN"or"MapReduce)"
Data"Storage"
Filesystem"
(HDFS)" Online"NoSQL"
(HBase)" Data"IntegraDon"(Sqoop,"Flume)" ©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent." 02#25% Hadoop"MapReduce"
! Hadoop%MapReduce%was%the%original%processing%engine%for%Hadoop%
– SDll"the"most"commonly"used"general"data"processing"engine"
! Based%on%the%the%‘map#reduce’%programming%model%
– A"style"of"processing"data"popularized"by"Go...
View
Full Document
- Fall '19
- Wind, Hadoop