CS411 - MapReduce - Note 1 - 2

And distributed database dbms x and vertica extra

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ’s compression works better than DBMS- X Performance Comparison (3 of 7) Map Reduce (40 of 44) Vertica wins, Hadoop loses in aggregation. Aggregation Task 1400 Time (seconds) 1200 1000 800 Hadoop 600 Vertica DBMS- X 400 200 0 25*40GB 50*20GB 100*10GB DBMS perform better because 1. No parsing Performance Comparison (4 of 7) 2. Compression. overheads; Map Reduce (41 of 44) - Database is good for traditional database workload Experimental Conclusion • Hadoop load times are faster • Very important for one- off processing tasks • Hadoop Query and aggregation times are a lot slower. • Parsing, compression and indexing in RDBMS • No compelling reason to choose MapReduce over a database for traditional database workload. - Map- Reduce should only be chosen for unstructured data, complex processing, workflow – but you might find it hard to program - There have been ongoing efforts to make the programming easier Performance Comparison (5 of 7) Map Reduce (42 of 44) PigLatin – High level scripting language integrated with Hadoop to make programming easier. It abstracts the map- reduce and makes programming simpler - Check out this video: msbiacademy.com/?p=6541 (It’s done by Kevin’s friend in Yahoo!) Here’s the brief introduction of it: “One of the reasons to use Hadoop as part of your data warehouse strategy is to take advantage of its ability to process data in a distributed way–massively parallel processing, or MPP. Another is to leverage its “schema on read” approach when processing unstructured data. In data warehousing terms, reading data from a source system is known as ETL, or “Extract/Transform/Load”. In MPP systems, it’s typically more efficient to transpose the T and L letters and use the “Extract/Load/Transform” pattern. Why? Because this pattern allows data transformation to leverage the full breadth of distributed processing nodes, resulting in superior performance. Pig, which is implements the PigLatin data flow language for Hadoop, is the most commonly used ELT technology in Hadoop clusters. In this introduction- level lesson we’ll look at a simple Pig script to process and transform IIS web logs. While this is shown on an HDInsight cluster within the Azure cloud platform, the process is identical for any Hadoop cluster.“ They have different target audiences, so they are neither a friends nor foes. The Future of MapReduce and Parallel Database: Friend or Foe? We need different tools because of the diversity. Performance Comparison (6 of 7) Map Reduce (43 of 44) Efforts Towards Integrating Parallel Databases and MapReduce • Supporting SQL queries over MapReduce Framework. • Hive, Pig, Scope, Dryad/Linq. • Supporting MapReduce Functions in Parallel Databases. • X Su, et al.: Oracle in- database Hadoop, when MapReduce meets RDBMS. In SIGMOD 2012 Performance Comparison (7 of 7) Map Reduce (44 of 44)...
View Full Document

Ask a homework question - tutors are online