L_wine.sp11 - 1 CS 525 Advanced Distributed Systems Spring...

Info iconThis preview shows pages 1–12. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 1 CS 525 Advanced Distributed Systems Spring 2011 Indranil Gupta (Indy) Old Wine: Stale or Vintage? April 14, 2011 All Slides IG Wisdom vs. Whipper-snapper Plenty of wisdom has evolved for distributed systems design over the decades Important to revisit these every once in a while and test them. 2 3 A comparison of approaches to large-scale data analysis A. Pavlo et al SIGMOD 2009 (there is also a shorter paper in CACM, Jan 2010) 4 Databases vs. Clouds Basic Question: Why use MapReduce (MR), when parallel databases have been around for decades and have been successful? Written by experts in databases research, including those who defined relational DBs many decades ago. Some in databases community felt they might be hurt by these new-fangled cloud computing paradigms which seemed to be merely reiventing the wheel 5 Relational Database Consists of schemas Employee schema, Company schema Schema = table consisting of tuples <Employee name, company> <Company name, number of employees> Tuple = consists of multiple fields, including primary key, foreign key, etc. <Employee name, company> SQL queries run on multiple schemas very efficiently 5 6 Parallel DB is similar to MR Parallel DBs = parallelize query operation across multiple servers Parallel DBs: data processing consists of 3 phases E.g., consider: joining two tables T1 and T2 1. Filter T1 and T2 parallelly (~ Map) 2. Distribute the larger of T1 and T2 across multiple nodes, then broadcast the other Ti to all nodes (~Shuffle) 3. Perform join at each node, and store back answers (~Reduce) Parallel DBs used in the paper: DBMS-X and Vertica MR representative: Hadoop 7 Advantages of Parallel DBs over MR Schema Support: more structured storage of data Really needed? Indexing: e.g., B-trees Really needed? What about BigTable? Programming model: more expressive What about Hive and Pig Latin? Execution strategy: push data instead of pull (as in MR) They argue it reduces bottlenecks Really? What about the network I/O bottleneck? 8 Load Times In general best for Hadoop The cost of structured schema and indexing Data loaded on nodes sequentially! (implementation artifact?) Are these operations really needed? 9 Actual Execution Hadoop always seems worst Overhead rises linearly with data stored per node Vertica is best Aggressive compression More effective as more data stored per node 10 10 Other Tasks MR was never built for Join! What about the MapReduce-Merge system? UDF does worst because of row-by-row interaction of DB and input file which is outside DB How common is this in parallel DBs? 11 11 What have we learnt?...
View Full Document

Page1 / 42

L_wine.sp11 - 1 CS 525 Advanced Distributed Systems Spring...

This preview shows document pages 1 - 12. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online