mapreduce-1

mapreduce-1 - CS 345A Data Mining MapReduce Single-node...

Info iconThis preview shows pages 1–9. Sign up to view the full content.

View Full Document Right Arrow Icon
CS 345A Data Mining MapReduce
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining
Background image of page 2
Commodity Clusters b Web data sets can be very large s Tens to hundreds of terabytes b Cannot mine on a single server (why?) b Standard architecture emerging: s Cluster of commodity Linux nodes s Gigabit ethernet interconnect b How to organize computations on this architecture? s Mask issues such as hardware failure
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Cluster Architecture Mem Disk CPU Mem Disk CPU Switch Each rack contains 16-64 nodes Mem Disk CPU Mem Disk CPU Switch Switch 1 Gbps between any pair of nodes in a rack 2-10 Gbps backbone between racks
Background image of page 4
Stable storage b First order problem: if nodes can fail, how can we store data persistently? b Answer: Distributed File System s Provides global file namespace s Google GFS; Hadoop HDFS; Kosmix KFS b Typical usage pattern s Huge files (100s of GB to TB) s Data is rarely updated in place s Reads and appends are common
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Distributed File System b Chunk Servers s File is split into contiguous chunks s Typically each chunk is 16-64MB s Each chunk replicated (usually 2x or 3x) s Try to keep replicas in different racks b Master node s a.k.a. Name Nodes in HDFS s Stores metadata s Might be replicated b Client library for file access s Talks to master to find chunk servers s Connects directly to chunkservers to access data
Background image of page 6
Warm up: Word Count b We have a large file of words, one word to a line b Count the number of times each distinct word appears in the file b Sample application: analyze web server logs to find popular URLs
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Word Count (2) b Case 1: Entire file fits in memory b Case 2: File too large for mem, but all
Background image of page 8
Image of page 9
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 28

mapreduce-1 - CS 345A Data Mining MapReduce Single-node...

This preview shows document pages 1 - 9. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online