CS411 - MapReduce - Note 1 - 2

Bad 2 a 1 ending 1 bad 1 a 1 news

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: be x, y coordinates and value could be image data. Key value could very simple and flexible 2. ETL tasks and data mining can be done by map reduce because you simply provide map function & reduce function and write the logic in the framework Data loading: don’t have to worry about schema, BCNF etc. A lot easier. Scenarios where MapReduce outperforms Distributed Databases • Scenario 4: Limited- budget and Robust. • Most MapReduce projects are open source and free. • MapReduce supports mid- query fault tolerance. • If a node fails, the query does not need to be restarted. • DBMSs typical don’t support it. • Only important as the number of nodes increases • 1 failure/month, 1 hour/query • Pr(mid_query_failure|10 nodes) = 1% • Pr(mid_query_failure|100 nodes) = 13% • Pr(mid_query_failure|1000 nodes) = 75% Back to Parallel Databases (2 of 2) Map Reduce (36 of 44) 4. For robustness: Example: suppose work counting is a query, it starts from reading documents, count words and sum up. It would be a long query. Then computer may crush when running it. Or nodes might not function well. Your partial result will be available to you. If you are running this in database environment and the whole thing does not go through, then the whole thing will be erased and you can’t get intermediate result and miss part of the result. It’s not good for this kind of workflow. => When running things on an unreliable cluster and the query is long, the chance of hardware failure is significant, Map- Reduce model helps us to pick up partial result Performance Comparison Performance Comparison (0 of 7) Map Reduce (37 of 44) Two underlying force that makes Map- Reduce a buzz word: Benchmark (Madden 2009) • Goals • Understand efficiency differences between MapReduce and distributed databases • Software • MapReduce: Hadoop • Distributed Databases: DBMS- X and Vertica • Ran on 100 node Linux cluster • Dataset: Grep (used in original MapReduce paper, 1TB of data) Performance Comparison (1 of 7) Map Reduce (38 of 44) - Data is everywhere and most of them are non- relational - Cluster is everywhere Benchmark comparison between map- reduce Hadoop and distributed database DBMS- X and Vertica. Extra resources of vertica: http://en.wikipedia.org/wiki/Vertica http://www.vertica.com/ http://en.wikipedia.org/wiki/Michael_Sto nebraker Hadoop wins on setup times significantly Load Times Time (seconds) 30000 25000 20000 Hadoop 15000 Vertica DBMS- X 10000 5000 0 25*40GB 50*20GB 100*10GB Databases don’t scale linearly; Hadoop does Performance Comparison (2 of 7) Map Reduce (39 of 44) Vertica is the winner, Hadoop is the loser. Query Time 1600 Time (seconds) 1400 1200 1000 Hadoop 800 Vertica 600 DBMS- X 400 200 0 25*40GB 50*20GB 100*10GB Vertica...
View Full Document

This note was uploaded on 01/28/2014 for the course CS 411 taught by Professor Staff during the Fall '08 term at University of Illinois, Urbana Champaign.

Ask a homework question - tutors are online