{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

mapreduce - CS 345A Data Mining MapReduce Single-node...

Info iconThis preview shows pages 1–8. Sign up to view the full content.

View Full Document Right Arrow Icon
CS 345A Data Mining MapReduce
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining
Background image of page 2
Commodity Clusters box3 Web data sets can be very large square6 Tens to hundreds of terabytes box3 Cannot mine on a single server (why?) box3 Standard architecture emerging: square6 Cluster of commodity Linux nodes square6 Gigabit ethernet interconnect box3 How to organize computations on this architecture? square6 Mask issues such as hardware failure
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Cluster Architecture Mem Disk CPU Mem Disk CPU Switch Each rack contains 16-64 nodes Mem Disk CPU Mem Disk CPU Switch Switch 1 Gbps between any pair of nodes in a rack 2-10 Gbps backbone between racks
Background image of page 4
Stable storage box3 First order problem: if nodes can fail, how can we store data persistently? box3 Answer: Distributed File System square6 Provides global file namespace square6 Google GFS; Hadoop HDFS; Kosmix KFS box3 Typical usage pattern square6 Huge files (100s of GB to TB) square6 Data is rarely updated in place square6 Reads and appends are common
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Distributed File System box3 Chunk Servers square6 File is split into contiguous chunks square6 Typically each chunk is 16-64MB square6 Each chunk replicated (usually 2x or 3x) square6 Try to keep replicas in different racks box3 Master node square6 a.k.a. Name Nodes in HDFS square6 Stores metadata square6 Might be replicated box3 Client library for file access square6 Talks to master to find chunk servers square6 Connects directly to chunkservers to access data
Background image of page 6
Warm up: Word Count box3
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 8
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}