3mapreduce - CS 345A Data Mining MapReduce Singlenode...

Info iconThis preview shows pages 1–9. Sign up to view the full content.

View Full Document Right Arrow Icon
    CS 345A Data Mining MapReduce
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
    Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining
Background image of page 2
    Commodity Clusters Web data sets can be very large  Tens to hundreds of terabytes Cannot mine on a single server (why?) Standard architecture emerging: Cluster of commodity Linux nodes Gigabit ethernet interconnect How to organize computations on this  architecture? Mask issues such as hardware failure
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
    Cluster Architecture Mem Disk CPU Mem Disk CPU Switch Each rack contains 16-64 nodes Mem Disk CPU Mem Disk CPU Switch Switch 1 Gbps between  any pair of nodes in a rack 2-10 Gbps backbone between racks
Background image of page 4
    Stable storage First order problem: if nodes can fail, how  can we store data persistently?  Answer: Distributed File System Provides global file namespace Google GFS; Hadoop HDFS; Kosmix KFS Typical usage pattern Huge files (100s of GB to TB) Data is rarely updated in place Reads and appends are common
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
    Distributed File System Chunk Servers File is split into contiguous chunks Typically each chunk is 16-64MB Each chunk replicated (usually 2x or 3x) Try to keep replicas in different racks Master node a.k.a. Name Nodes in HDFS Stores metadata Might be replicated Client library for file access Talks to master to find chunk servers  Connects directly to chunkservers to access data
Background image of page 6
    Warm up: Word Count We have a large file of words, one word to  a line Count the number of times each distinct  word appears in the file Sample application: analyze web server  logs to find popular URLs
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
    Word Count (2) Case 1: Entire file fits in memory Case 2: File too large for mem, but all  <word, count> pairs fit in mem
Background image of page 8
Image of page 9
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 09/17/2009 for the course IT it771 taught by Professor Jenisha during the Fall '09 term at University of Advancing Technology.

Page1 / 28

3mapreduce - CS 345A Data Mining MapReduce Singlenode...

This preview shows document pages 1 - 9. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online