CS411 - MapReduce - Note 1 - 2

First use key value pairs in json or xml or other

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Do We Learn This? (3 of 7) There are scenarios that parallel database is not sufficient. Mapreduce and other no sql database become new ways to solve several dilemmas relational database is lacking capability of. Map Reduce (6 of 44) Scenario 1: Semi- Structured Data • A lot of Web data are semi- structured without predefined schema. First: Use key- value pairs in JSON or XML or other semi- structured data. See more in previous semi- structure database slides for a review. Inconsistent with relational model in RDBMS Why Do We Learn This? (4 of 7) Map Reduce (7 of 44) Scenario 2: ETL (Extraction, Transform and Load) and “read once” Tasks Example: Web Logs Processing Counting Word Distribution Web Logs Aggregated by Users Trend Analysis Useful Statistics It is unnecessary to store data in DBMS for querying. Why Do We Learn This? (5 of 7) Map Reduce (8 of 44) ETL: load ,extract, and transform data, a work flow before SQL database. As second scenario. http://en.wikipedia.org/wiki/Extract,_tra nsform,_load For data mining purposes. Complex operation far beyond the Greek symbols (RA query) Scenario 3: Data Mining Applications Example: K- Means Data Initial Assignment of Clusters Find Cluster Center Reassign Data ... Could not be structured as single SQL queries. Why Do We Learn This? (6 of 7) Map Reduce (9 of 44) Really expensive traditional database. Scenario 4: Limited- budge and Robust Long work flow: transaction semantic with no partial result. • Open source distributed database systems are not robust enough • Commercial distributed database systems are expensive Why Do We Learn This? (7 of 7) All of them are shortcomings of SQL. Map Reduce (10 of 44) What is MapReduce? What is MapReduce? (0 of 3) Map Reduce (11 of 44) Map- reduce history What is MapReduce? • A Programming model for large- scale distributed data processing • History • The actual origins of Mapreduce are arguable, but the paper which is most cited is “MapReduce: Simplified Data Processing on Large Clusters” by Jeffrey Dean and Sanjay Ghemawat in 2004 • Pioneer of MapReduce implementation: Hadoop Framework by Doug Cutting and … • Today, numerous independent people and organizations contribute to MapReduce Project What is MapReduce? (1 of 3) Map Reduce (12 of 44) Wide range of usage of Mapreduce in Google’s applications. MapReduce in Google • “Googlers’ Hammer for 80% of our Data crunching” • • • • • Large scale web search indexing Clustering problems for Google News Produce...
View Full Document

This note was uploaded on 01/28/2014 for the course CS 411 taught by Professor Staff during the Fall '08 term at University of Illinois, Urbana Champaign.

Ask a homework question - tutors are online